Compression
Overview of how compression works in WarpStream and how to configure it.
Last updated
Overview of how compression works in WarpStream and how to configure it.
Last updated
Compression in open source Apache Kafka is (mostly) controlled by the producer clients. Producers batch records together to create compressed batches, send those compressed batches to the Kafka brokers, and then the Kafka brokers write them to disk unchanged.
Similarly, when a Consumer client fetches records from the Kafka Broker, the broker will transmit the records "as is" over the wire without modification (using the SendFile syscall).
This means that in practice with Apache Kafka, the compression of batches both at rest and over the wire is almost completely controlled by the Producer clients.
Unlike Apache Kafka, WarpStream's storage engine is not tightly coupled with the record-batch format. As a result, when batches are sent to the WarpStream Agents by producer clients, they're decompressed, encoded into WarpStream's file format, and then recompressed.
While this may sound expensive, in practice it's quite cheap, especially when you consider the fact that in Kafka every record-batch has to be processed by at least three different Kafka brokers (due to replication), whereas in WarpStream it will only be processed by a single Agent. This approach has a number of other benefits as well, but we won't delve into them right now.
Similarly, when a Consumer client fetches records from a WarpStream Agent, the records are decoded/decompressed out of WarpStream's file format, and then re-encoded/compressed into Kafka's record-batch format. The practical implication of this is that users can control which compression algorithm is used to compress the record-batches that are sent to the consumer clients independently from the compression that was used by the storage engine and producers.
For example, in WarpStream it's possible to configure your producer clients to compress records using LZ4 over the wire, then have the WarpStream Agents store them using ZSTD in the object storage bucket, and finally transmit them as GZIP from the Agents to the consumer.
The table below compares Open Source Kafka with WarpStream in terms of who is in control of the compression algorithm for every operation.
Operation | Kafka | WarpStream |
---|---|---|
There are two Agent level flags that control compression:
-storageCompression
(WARPSTREAM_STORAGE_COMPRESSION
)
Default: lz4
Will be changed to zstd in the future
Valid options: lz4
, zstd
Topic level configuration override: N/A
-kafkaFetchCompression
(WARPSTREAM_KAFKA_FETCH_COMPRESSION
)
Default: lz4
Valid options: lz4
, snappy
, gzip
, zstd
, none
Topic level configuration override: warpstream.compression.type.fetch
Setting this value on the topic's configuration will override which compression algorithm is used to return compressed batches to consumers for this topic, regardless of which setting is configured in the Agents.
For example, setting warpstream.compression.type.fetch=zstd
on topic logs
would cause all consumers of the logs
topic to receive record-batches compressed using zstd
, even if the value of kafkaFetchCompression
was set to lz4
.
WarpStream eliminates all inter-zone networking fees, and replaces all local disks / EBS volumes with only object storage. As a result, it would be easy to conclude that compression is not nearly as important for WarpStream clusters as it is for Apache Kafka clusters. After all, networking is free and object storage is ~24x cheaper than EBS per GiB-stored.
However, compression is still important for WarpStream clusters for two reasons:
First, workloads with long retention will still see significant cost savings, even when using object storage as the only storage tier, with improved compression ratios.
Second, even though networking is free, the networking capacity of the cloud VMs that the Agents are running on is not unlimited. Between producers, consumers, and background compactions, the WarpStream Agents are very network intensive. In general, we recommend running on network-optimized instances like m6in
in AWS, but even so, the best way to prevent the Agents from exceeding the network capacity of their VMs is to use a strong compression algorithm like ZSTD.
There is one additional difference between WarpStream and Kafka with regards to compression and Fetch requests. When Kafka clients issue Fetch requests, they specify a limit on the amount of data to be returned (in aggregate, and per-partition) by the Broker. In Apache Kafka, the brokers interpret the limit in terms of compressed bytes, but in WarpStream the Agents interpret the limit in terms of uncompressed bytes.
WarpStream intentionally breaks from the behavior of Apache Kafka here primarily for safety reasons. A common issue with Apache Kafka is that something changes in the workload which dramatically improves the compression ratio and then causes the downstream consumers to begin OOMing. This is particularly problematic because the change could be something as benign as the upstream consumers changing compression algorithms, or sending a burst of highly repetitive and easily compressible data. This problem would have been exacerbated even further in WarpStream because the background compactions that the Agents perform improve data locality, and as a result, data compression.
WarpStream avoids all of these issues altogether by interpreting the limits as uncompressed bytes so that the amount of raw data returned in each Fetch remains the same regardless of fluctuations in the underlying compression ratio.
One side effect of this decision is that for some workloads, you will need to tune your Kafka consumer clients to request more data per individual Fetch request to achieve the same throughput with WarpStream. We have detailed recommended settings for a variety of different clients in our tuning for performance documentation.
Producer --> Broker/Agent
Producer Client
Producer Client
Storage (at rest)
Producer Client
Agent Configuration
Broker/Agent --> Consumer
Producer Client
Agent Configuration