Comment on page
Contains a history of changes made to the Agent and Control Plane.
- 1.Fixed a bug in the Fetch() code that was not setting the correct topic ID in error responses which made some Kafka clients emit warning logs when this happened.
- 2.Fixed a bug in the "roles" feature that was causing Agents with the "produce" role to still participate in the distributed file cache. Now only Agents with the "consume" role will participate in the file cache, as expected.
- 1.Circuit breakers will now return example errors for clarity.
- 2.Fetch() code path will now handle failures more gracefully by returning incremental results in more scenarios which improves the system's ability to recover under load.
- 3.Fix a memory leak in the in-memory file cache implementation.
- 1.Docker images are now multi-arch, our documentation and official kubernetes charts has been updated accordingly.
- 2.Introduced circuit breakers around object storage access.
- 3.Finer control over agent roles: it is now possible to split between the
proxy-produceroles, our documentation has been updated as well.
This release is the first phase of a two-phase upgrade to WarpStream's internal file format. This release adds support for reading the upgraded file format. You MUST upgrade all Agents to this version before moving from any version < v520 to any version > than v520.
- 1.Support kafka Headers: if you produce messages containing Kafka headers, they will now be automatically persisted to your cloud object storage, and will be read when fetching.
- 2.Revisit the flags and configuration knobs to choose how the agents advertise themselves in Warpstream service discovery. Our documentation has been updated accordingly.
- 1.Fully support kafka
ListOffsetsprotocol: you can now look for partition offsets based on timestamps.
- 1.Fixed a bug related to the handling of empty (but not null) values in records in the Fetch implementation.
- 1.The agent will now report a sample of its error logs back to Warpstream control plane. It should ease troubleshooting and help us identify issues earlier. This can be disabled with the flag
disableLogsCollectionor the environment variable
- 1.Added batching in the metadata calls made during Kafka
Fetch, improving memory usage along the way.
- 1.Added support for Kafka's
InitProducerIDprotocol message, and the idempotent producer functionality in general. Requires upgrading to a version of the Agents that is >=
- 2.Added support for Kafka
ListOffsetswith positive timestamps value (until now only negative values for special cases were supported)
- 1.The Agents will now report the lag / max offsets for every active consumer group as standard metrics. The metrics can be found as
- 2.The Agents will now report the number of files at each compaction level so that user's can monitor whether they are experiencing compaction lag. These metrics can be found as
warpstream_files_countand the level is tagged with the name
- 1.File cache is now partitioned by
<file_id, 16MiB extent>instead of just
<file_id>. This spreads the load for fetching data for large files more evenly amongst all the Agents.
- 2.Added some logic in the file cache to detect when certain parts of the cache are experiencing high churn and reduce the default IO size for paging in data from object storage. This helps avoid filling the cache with data that won't be read.
- 3.Fixed a bug in the file cache that was causing it to significantly *over* fetch data in some scenarios. This did not cause any correctness problems, but it wasted network bandwidth and CPU cycles.
- 4.Modified the implementation of the Kafka
Fetchmethod to return incremental results when it experiences a retryable error mid-fetch. This makes the Agents much better at recovering from disruption and catching consumer lag incrementally.
- 5.Added some pre-fetching logic into the Kafka
Fetchmethod so that when data for a single partition is spread amongst many files the Agent doesn't get bottlenecked making many single-threaded RPCs. This mostly helps increase the speed at which individual partitions can be "caught up" when lagging.
- 6.Increased the default maximum file size created at ingestion time from 4MiB to 8MiB. This improves performance for extremely high volume workloads.
- 7.Added replication to the Agent file cache so that if an error is experienced trying to load data from the file cache on the Agent node that is "responsible" for a chunk of data, the client can retry on a different node. This helps minimize disruption when Agents shutdown ungracefully.
- 8.Agents now report their CPU utilization to the control plane. We will use this information in the future to improve load balancing decisions. CPU utilization can be view in the WarpStream Admin console now as well.
- 9.Improved the performance of deserializing file footers.
- 10.Standardized prometheus metric names prefixes.
- 11.Added a lot more metrics and instrumentation, especially around the blob storage library and file cache.
- 1.Added support for the Kafka protocol message
- 1.We added intelligent throttling / scheduling to the deadscanner scheduler. This scheduler is responsible for scheduling jobs that run in the Agent to scan for "dead files" in object storage and delete them. Previously these jobs could run with high frequency and rates which would interfere with live workloads. In addition, they could also result in very high object storage API requests costs due to excessive amounts of
LISTrequests. The new implementation is much more intelligent and automatically tunes the frequency to avoid disrupting the live workload and incurring high API request fees.