Change Log

Contains a history of changes made to the Agent.

Release v578

August 30th, 2024

  • Add "ratelimited" outcome for S3-backed blob storage metrics warpstream_blob_store_*

  • Increased maximum allowed ingestion file size from 8MiB to 16MiB

  • Improve buckets for distribution metrics for warpstream_agent_segment_batcher_flush_file_size_uncompressed_bytes and warpstream_agent_segment_batcher_flush_file_size_compressed_bytes

  • Fixed a bug where we returned the wrong error code when a produced record was larger than the maximum allowed size (32MiB)

Release v577

August 25th, 2024

  • Fix bug in backpressure system that would cause Agents to get stuck in backpressure state.

Release v576

August 22nd, 2024

  • Fix bug affecting librdkafka library's cooperative rebalance, when adding more clients to a consumer group.

  • Add new startup flags to enable mutex and block profiling:

    • -enableSetMutexProfileFraction/WARPSTREAM_ENABLE_SET_MUTEX_PROFILE_FRACTION: enable this flag to call 'runtime.SetMutexProfileFraction' with the value passed along -mutexProfileFraction.

    • -mutexProfileFraction/WARPSTREAM_MUTEX_PROFILE_FRACTION: tune the value passed to call 'runtime.SetMutexProfileFraction

    • -enableSetBlockProfileRate/WARPSTREAM_ENABLE_SET_BLOCK_PROFILE_RATE: enable this flag to call 'runtime.SetBlockProfileRate' with the value passed along -blockProfileRate

    • -blockProfileRate/WARPSTREAM_BLOCK_PROFILE_RATE: tune the value passed to call 'runtime.SetBlockProfileRate'

Release v575

August 18th, 2024

  • Report CPU usage using a moving average instead of point-in-time to improve accuracy.

  • Improve performance when a cluster has many clients polling partitions that are written to infrequently.

Release v574

August 15th, 2024

  • Increase default limits for backpressuring produce requests by 400%

Release v573

August 14th, 2024

  • Disable automatic retries in underlying blob storage clients (AWS / GCP) so we can control blob storage retries at the application layer.

  • Switch to C++ library for ZSTD instead of pure Go (2-3x faster).

Release v572

August 6th, 2024

  • Improved the Agents ability to backpressure Produce requests.

  • Added back-pressure support for Fetch requests in addition to Produce requests.

Release v571

August 2nd, 2024

  • Improved Agent backpressuring for Produce requests so that Agents will begin throttling Produce request and individual connections when they have too much producer data buffered in memory instead of OOMing.

  • Agents now return KafkaStorageError instead of RequestTimedOut for some timeout-related errors in the Fetch path. This improves behavior with Java consumer clients which treat KafkaStorageError as retriable, but not RequestedTimedOut.

  • Fixed a bug in demo / playground mode where the requested hostname strategy was not being used.

  • Tagged all Agent metrics with the agent_roles , agent_group , and virtual_cluster_id tags to make debugging advanced deloyments easier.

Release v570

July 17th, 2024

  • Fixed a bug that would cause idempotent producer out of sequence errors when idempotent producer was enabled on some clients. Data was never committed out of order, but the previous implementation would force the client to retry more than necessary, resulting in high latency or reduced throughput

  • Added Schema Validation Support:

Release v569

July 5th, 2024

  • Add a new warpstream_max_offset metric tagged by topic/partition (that can be controlled with the existing disableConsumerGroupsMetricsTags flag).

  • The existing warpstream_consumer_group_max_offset metric is deprecated as it shares the same value across consumer groups and it can be replaced with the new metric mentioned above.

  • Add the topic label/tag to warpstream_agent_kafka_produce_uncompressed_bytes_counter that was missing it.

Release v568

July 3rd, 2024

  • Various small performance improvements (batching, allocations, etc).

  • Batch Metadata and FindCoordinator request RPCs between the Agents and the control plane. Helpful for workloads with a large number of consumer/producer clients.

Release v567

June 19th, 2024

  • Allow individual fetch requests to fetch up to 128MiB of uncompressed data per topic-partition, per fetch request, increased from 32 MiB.

  • Add -kafkaMaxFetchRequestBytesUncompressedOverride flag and WARPSTREAM_KAFKA_MAX_FETCH_REQUEST_BYTES_UNCOMPRESSED_OVERRIDE env variable to override maximum number of uncompressed bytes that can be fetched in a single fetch request.

  • Add -kafkaMaxFetchPartitionBytesUncompressedOverride and WARPSTREAM_KAFKA_MAX_FETCH_PARTITION_BYTES_UNCOMPRESSED_OVERRIDE env variable to override maximum number of uncompressed bytes that can be fetched for a single topic-partition in a single fetch request.

  • Introduced the metric warpstream_consumer_group_generation_id with the tags consumer_group. This metric indicates the generation number of the consumer group, incrementing by one with each rebalance. It serves as an effective indicator for detecting occurrences of rebalances.

  • Added metric warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds (tags: consumer_group, topic, partition). Provides an estimated lag in seconds (equivalent to warpstream_consumer_group_lag but in time units instead of offsets), calculated via interpolation/extrapolation. Caution: Very coarse; unsuitable for precise end-to-end latency measurement.

  • Renamed Flags and Environment Variables:

    • benthosBucketURL (WARPSTREAM_BENTHOS_BUCKET_URL) is now bentoBucketURL (WARPSTREAM_BENTO_BUCKET_URL): Bucket URL to use when fetching the Bento configuration.

    • benthosConfigPath (WARPSTREAM_BENTHOS_CONFIG_PATH) is now bentoConfigPath (WARPSTREAM_BENTO_CONFIG_PATH): Path in the bucket to fetch the Bento configuration.

Release v566

June 5th, 2024

  • Fix broken formatting of logs that were sometimes logfmt when they should've been JSON.

Release v565

June 5th, 2024

  • Improve the Agent's ability to handle certain pathological compactions where an input stream is not read from for so long that the object store closes the connection. The Agent will now detect this scenario and issue a new GET request to resume the compaction instead of failing it entirely and never making progress.

Release v564

May 26th, 2024

  • Improve performance of file cache for high throughput workloads by improving how batching is handled.

  • Add support for more Benthos components: [awk, jsonpath, lang, msgpack, parquet, protobuf, xml, zstd].

  • Fix rare memory leak.

  • Fix rare bug that would cause circuit breakeres to get stuck in open / half-open state forever.

Release v563

May 22nd, 2024

  • Fixed a bug in the Agent that prevented users from starting playgrounds or demos.

Release v562

May 19th, 2024

Agent

  • Make Agent batch size configurable using -batchMaxSizeBytes flag and WARPSTREAM_BATCH_MAX_SIZE_BYTES environment variable.

  • Metrics

    • Add new Agent metric (gauge) that indicates state of consumer group: warpstream_consumer_group_state. Metric has two tags: consumer_group and group_state.

    • Add new Agent metric (gauge) that indicates the number of members in each consumer group: warpstream_consumer_group_num_members. Metric has one primary tag: consumer_group.

    • Add new Agent metric (gauge) that indicates the number of topics in each consumer group: warpstream_consumer_group_num_topics. Metric has one primary tag: consumer_group.

    • Add new Agent metric (gauge) that indicates the number of partitions in each consumer group: warpstream_consumer_group_num_partitions. Metric has one primary tag: consumer_group.

    • Add new Agent metric (gauge) that indicates the total number of topics in the cluster: warpstream_topic_count.

    • Add new Agent metric (gauge) that indicates the limit for the total number of topics in the cluster: warpstream_topic_count_limit.

    • Add new Agent metric (gauge) that indicates the total number of partitions in the cluster: warpstream_partition_count.

    • Add new Agent metric (gauge) that indicates the limit for the total number of partitions in the cluster: warpstream_partition_count_limit.

Release v559

May 13th, 2024

Agent

  • Add support for WarpStream Managed Data Pipelines

  • Change the default number of uncompressed bytes that will be be buffered before flushing a file to object storage from 8MiB to 4MiB

Release v558

May 10th, 2024

Agent

  • Support adopting AWS roles for providing access to S3 via environment variables instead of a query argument in the bucket URL.

  • Tuned circuit breakers to be less aggressive by default, and configured the retry mechanisms to avoid retrying circuit breaker errors.

  • New Metrics

    • warpstream_circuit_breaker_open: Records the number of times the circuit breaker is opened, tagged by the circuit breaker name to identify the associated operations. - tags: name

    • warpstream_circuit_breaker_close: Records the number of times the circuit breaker is closed, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

    • warpstream_circuit_breaker_halfopen: Records the number of times the circuit breaker is half-opened, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

    • warpstream_circuit_breaker_error: Records the number of times the circuit breaker blocks an operation, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

Release v557

May 9th, 2024

Agent

  • Enhanced Buckets: Refined the buckets for warpstream_agent_kafka_fetch_uncompressed_bytes, warpstream_agent_kafka_fetch_compressed_bytes, warpstream_agent_kafka_produce_compressed_bytes and warpstream_agent_kafka_produce_uncompressed_bytes to ensure they are more precise. The previous bucket configuration was too sparse.

  • Fixed Prometheus Counters: Resolved an issue with Prometheus counters previously were not appropriately incrementing the value.

Release v556

May 9th, 2024

Agent

  • Tune circuit breakers to be less aggressive, and don't retry requests that failed due to a circuit breaker error.

  • New metrics:

    • warpstream_circuit_breaker_open: Records the number of times the circuit breaker is opened, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

    • warpstream_circuit_breaker_close: Records the number of times the circuit breaker is closed, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

    • warpstream_circuit_breaker_halfopen: Records the number of times the circuit breaker is half-opened, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

    • warpstream_circuit_breaker_error: Records the number of times the circuit breaker blocks an operation, tagged by the circuit breaker name to identify the associated operations.

      • tags: name

Release v555

May 8th, 2024

Agent

  • New metrics:

    • warpstream_agent_kafka_fetch_uncompressed_bytes_counter: Tracks the count of uncompressed bytes fetched. Although a histogram version (warpstream_agent_kafka_fetch_uncompressed_bytes) already exists, it may not provide an accurate count in some metrics systems. That's why we've made this data directly available as a counter.

      • tags: topic (requires enabling high-cardinality metrics)

    • agent_kafka_produce_uncompressed_bytes_counter: Tracks the count of uncompressed bytes produced. While the histogram version (warpstream_agent_kafka_produce_uncompressed_bytes) is available, some metrics systems might not accurately count it. So, we're also offering this data directly as a counter.

      • tags: topic (requires enabling high-cardinality metrics)

Release v554

April 30th, 2024

Agent

  • Replaces all instances of Ristretto cache with a simple LRU cache, dramatically reduces the number of caches misses in topic metadata related caches.

Release v553

April 30th, 2024

Agent

  • Rename metrics:

    • warpstream_agent_kafka_inflight_conn has been renamed to warpstream_agent_kafka_inflight_connections

    • warpstream_agent_kafka_inflight_request has been renamed to warpstream_agent_kafka_inflight_requests

Release v552

April 29th, 2024

Agent

  • Dramatically improve the performance of fetch requests that query 100s or 1000s of partitions in a single fetch request.

  • Add speculative retries for blob storage reads in the fetch path (in addition to existing speculative retries for blob storage writes in the write path).

Release v551

April 26th, 2024

Agent

  • Convert an error log (that did not actually represent an error) to an info log to reduce error spam.

Release v550

April 23nd, 2024

Agent

  • High-Cardinality Metrics: Introduced a new agent configuration for enabling high-cardinality metrics. To activate this feature, use the command-line flag -kafkaHighCardinalityMetrics or set the environment variable WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true.

  • Deprecated metrics:

    • warpstream_agent_kafka_fetch_bytes_sent

    • warpstream_agent_segment_batcher_flush_file_size_counter_type

    • warpstream_agent_segment_batcher_flush_file_size

    • blob_store_list_latency

    • blob_store_list_count

    • blob_store_delete_latency

    • blob_store_delete_count

    • blob_store_put_bytes_latency

    • blob_store_put_bytes_count

    • blob_store_put_stream_latency

    • blob_store_put_stream_count

    • blob_store_get_bytes_latency

    • blob_store_get_bytes_count

    • blob_store_get_stream_latency

    • blob_store_get_stream_count

    • blob_store_get_bytes_range_latency

    • blob_store_get_bytes_range_count

    • blob_store_get_stream_range_latency

    • blob_store_get_stream_range_count

  • New metrics:

    • warpstream_agent_kafka_fetch_uncompressed_bytes: Tracks the total uncompressed bytes fetched, replacing warpstream_agent_kafka_fetch_bytes_sent.

      • tags: topic (requires enabling high-cardinality metrics)

    • warpstream_agent_kafka_produce_uncompressed_bytes: Tracks the number of uncompressed bytes produced.

      • tags: topic (requires enabling high-cardinality metrics)

    • warpstream_agent_segment_batcher_flush_file_size_uncompressed_bytes: Tracks the uncompressed size of files stored after batching, serving as a replacement for warpstream_agent_segment_batcher_flush_file_size.

    • warpstream_blob_store_operation_latency: Tracks the latency and count of individual object storage operations. It's a replacement for all deprecated blob_store_ metrics. Now all the different operations are tagged within the same metric.

      • tags: operation,outcome

Release v549

April 22nd, 2024

Agent

  1. Adds support for authenticating Kafka clients using mTLS. A -requireMTLSAuthentication flag is added, and the previous -tlsVerifyClientCert flag has been deprecated. A new -requireSASLAuthentication flag is added, and the previous -requireAuthentication flag is deprecated. When authenticating using mTLS, the Distinguished Name(DN) from the client certificate is used as the Kafka Principal. The -tlsPrincipalMappingRule flag can be used to specify a Regex to extract a principal from the DN. For example, the rule CN=([^,]+) will extract the Common Name(CN) from the DN, and use that as the ACL principal.

Release v548

April 17th, 2024

Agent

  1. Fix a rare bug where some compactions could fail if you had tombstone expiration enabled and very little data to compact.

Release v547

April 11nd, 2024

Agent

  1. Fix playground mode against latest control plane signup constraints

Release v545

April 4nd, 2024

Agent

  1. Added support to new regions. There is a new "-region" (or WARPSTREAM_REGION env variable) that you can leverage to pick among the supported regions - you can use the console to see the current list.

Release v544

April 2nd, 2024

Agent

  1. Added support for TLS/mTLS for Warpstream Agent and kafka client connections. The -kafkaTLS(env WARPSTREAM_TLS_ENABLED=true), -tlsServerCertFile(env WARPSTREAM_TLS_SERVER_CERT_FILE=<filepath>), tlsServerPrivateKeyFile(env WARPSTREAM_TLS_SERVER_PRIVATE_KEY_FILE=<filepath>), -tlsVerifyClientCert(env WARPSTREAM_TLS_VERIFY_CLIENT_CERT=true), -tlsClientCACertFile(env WARPSTREAM_TLS_CLIENT_CA_CERT_FILE=<filepath>) flags were added which respectively are used to enable tls, pass the TLS server certificate to the Agent, pass the TLS server private key to the Agent, optionally enable client certificate verification, and to optionally pass the TLS root certificate authority certificate file to the server.

Release v543

March 29th, 2024

Agent

  1. Adding a new "availabilityZoneRequired" (or env variable "WARPSTREAM_AVAILABILITY_ZONE_REQUIRED") flag. When enabled, the agent will synchronously try to resolve its availability zone during startup for 1 min, and will not start serving its /v1/status health check until it succeeds. The process will exit early if it did not manage to resolve the availability zone.

Release v542

March 26th, 2024

Agent

  1. Fixed a race condition during startup that could make some agents advertise their availability zone as "WARPSTREAM_UNSET_AZ"

Release v541

March 25th, 2024

Agent

  1. Added beta support for benthos in the WarpStream Agents

Release v539

March 15th, 2024

Agent

  1. Adding a new "agentGroup" parameter to define the name of the 'group' that the Agent belongs to. This feature is used to isolate groups of Agents that belong to the same logical cluster, but should not communicate with each other because they're deployed in separate cloud accounts, vpcs, or regions. By default the agent belongs to the default group.

Release v538

March 6th, 2024

Agent

  1. Fixed a rare panic in the Agents caught by Antithesis

Release v537

March 1st, 2024

Agent

  1. Kafka "compacted topics" are generally available.

Release v536

February 28th, 2024

Agent

  1. Fix a bug with the warpstream playground command, introduced in v535.

Release v535

February 26th, 2024

Agent

  1. Make agent pool name argument completely optional, even when using non-default clusters.

  2. Fixed a memory leak that would cause some workloads to use an excessive amount of memory over time.

  3. Add support for S3 express and separating the "ingestion" bucket from the "compaction" bucket so data can be landed into low-latency storage and then immediately compacted into lower cost storage.

  4. Add speculative retries to file flushing which dramatically reduces outlier latency for Produce requests.

Release v534

Agent

Bug fixes

  1. Fixed another bug in the fetch logic that would result in the Agent returning empty batches in some scenarios when transient network errors occurred. This did not cause any correctness issues, but would make librdkafka refuse to proceed in some scenarios and block consumption of some partitions.

Release v533

Agent

Bug fixes

  1. Fixed a bug in the fetch logic that would result in the Agent returning empty batches in some scenarios depending on the client configuration. This did not cause any correctness issues, but would make librdkafka refuse to proceed in some scenarios and block consumption of some partitions.

Release v532

Agent

New Features

  1. Improves roles reporting to Warpstream control plane so that they can be properly rendered in our console.

Release v531

Agent

New Features

  1. Supports automatic availability zone detection in kubernetes reading the node zone label (requires version 0.10.0 of our helm charts).

Release v530

Agent

New Features

  1. Compatibility with CreatePartitions Kafka API for updating partitions count.

Release v529

Agent

New Features

  1. Treat grade nat as private IP addresses, and allow advertising it.

Performance improvements

  1. Improve caching in the agent.

  2. Improve batching of requests to the warpstream metadata backend.

  3. Other general performance improvements.

Release v526

Agent

New Features

  1. Full-support for ACLs using SASL credentials.

Release v525

Agent

Bug Fixes and performance Improvements

  1. Improved object pooling in RPC layer to reduce memory usage.

  2. Add transparent batching to the file cache RPCs to dramatically reduce the number of inter-agent RPCs for high partition workloads.

  3. Enhanced Apache Kafka compatibility by refining the handling of watermarks in Fetch responses.

Release v524

Agent

Dynamic Consumer Group Rebalance Timeout

  1. Partially handle consumer group requests (JoinGroup and SyncGroup) in the agent, to use the clients' rebalance timeout, instead of the previous 10s default. This enhancement will minimize unnecessary rebalance attempts in larger consumer groups with numerous members, due to short timeouts.

Release v523

Agent

Bug Fixes and performance Improvements

  1. Fixed a bug in the Fetch() code that was not setting the correct topic ID in error responses which made some Kafka clients emit warning logs when this happened.

  2. Fixed a bug in the "roles" feature that was causing Agents with the "produce" role to still participate in the distributed file cache. Now only Agents with the "consume" role will participate in the file cache, as expected.

Release v522

Agent

Bug Fixes and performance Improvements

  1. Circuit breakers will now return example errors for clarity.

  2. Fetch() code path will now handle failures more gracefully by returning incremental results in more scenarios which improves the system's ability to recover under load.

  3. Fix a memory leak in the in-memory file cache implementation.

Release v521

Agent

New Features

  1. Docker images are now multi-arch, our documentation and official kubernetes charts has been updated accordingly.

  2. Introduced circuit breakers around object storage access.

  3. Finer control over agent roles: it is now possible to split between the proxy-consume and proxy-produce roles, our documentation has been updated as well.

Release v520

Agent

This release is the first phase of a two-phase upgrade to WarpStream's internal file format. This release adds support for reading the upgraded file format. You MUST upgrade all Agents to this version before moving from any version < v520 to any version > than v520.

Release v518

Agent

New Features

  1. Support kafka Headers: if you produce messages containing Kafka headers, they will now be automatically persisted to your cloud object storage, and will be read when fetching.

  2. Revisit the flags and configuration knobs to choose how the agents advertise themselves in Warpstream service discovery. Our documentation has been updated accordingly.

  3. Agent nodes can now be configured to run dedicated roles - see splitting roles documentation.

Control Plane

New Features

  1. Fully support kafka ListOffsets protocol: you can now look for partition offsets based on timestamps.

Release v517

Agent

Bug Fixes and performance Improvements

  1. Fixed a bug related to the handling of empty (but not null) values in records in the Fetch implementation.

Release v516

Agent

New Features

  1. The agent will now report a sample of its error logs back to Warpstream control plane. It should ease troubleshooting and help us identify issues earlier. This can be disabled with the flag disableLogsCollection or the environment variable WARPSTREAM_DISABLE_LOGS_COLLECTION.

Bug Fixes and performance Improvements

  1. Added batching in the metadata calls made during Kafka Fetch, improving memory usage along the way.

Control Plane

New Features

  1. Added support for Kafka's InitProducerID protocol message, and the idempotent producer functionality in general. Requires upgrading to a version of the Agents that is >= v515.

  2. Added support for Kafka ListOffsets with positive timestamps value (until now only negative values for special cases were supported)

Release v515

Agent

New Features

  1. The Agents will now report the lag / max offsets for every active consumer group as standard metrics. The metrics can be found as warpstream_consumer_group_lag and warpstream_consumer_group_max_offset .

  2. The Agents will now report the number of files at each compaction level so that user's can monitor whether they are experiencing compaction lag. These metrics can be found as warpstream_files_count and the level is tagged with the name compaction_level.

Bug Fixes and performance Improvements

  1. File cache is now partitioned by <file_id, 16MiB extent> instead of just <file_id>. This spreads the load for fetching data for large files more evenly amongst all the Agents.

  2. Added some logic in the file cache to detect when certain parts of the cache are experiencing high churn and reduce the default IO size for paging in data from object storage. This helps avoid filling the cache with data that won't be read.

  3. Fixed a bug in the file cache that was causing it to significantly *over* fetch data in some scenarios. This did not cause any correctness problems, but it wasted network bandwidth and CPU cycles.

  4. Modified the implementation of the Kafka Fetch method to return incremental results when it experiences a retryable error mid-fetch. This makes the Agents much better at recovering from disruption and catching consumer lag incrementally.

  5. Added some pre-fetching logic into the Kafka Fetch method so that when data for a single partition is spread amongst many files the Agent doesn't get bottlenecked making many single-threaded RPCs. This mostly helps increase the speed at which individual partitions can be "caught up" when lagging.

  6. Increased the default maximum file size created at ingestion time from 4MiB to 8MiB. This improves performance for extremely high volume workloads.

  7. Added replication to the Agent file cache so that if an error is experienced trying to load data from the file cache on the Agent node that is "responsible" for a chunk of data, the client can retry on a different node. This helps minimize disruption when Agents shutdown ungracefully.

  8. Agents now report their CPU utilization to the control plane. We will use this information in the future to improve load balancing decisions. CPU utilization can be view in the WarpStream Admin console now as well.

  9. Improved the performance of deserializing file footers.

  10. Standardized prometheus metric names prefixes.

  11. Added a lot more metrics and instrumentation, especially around the blob storage library and file cache.

Control Plane

New Features

  1. Added support for the Kafka protocol message DeleteTopics .

Bug Fixes and Performance Improvements

  1. We added intelligent throttling / scheduling to the deadscanner scheduler. This scheduler is responsible for scheduling jobs that run in the Agent to scan for "dead files" in object storage and delete them. Previously these jobs could run with high frequency and rates which would interfere with live workloads. In addition, they could also result in very high object storage API requests costs due to excessive amounts of LIST requests. The new implementation is much more intelligent and automatically tunes the frequency to avoid disrupting the live workload and incurring high API request fees.

Last updated