Compression
Overview of how compression works in WarpStream and how to configure it.
How It Works in Kafka
Compression in open source Apache Kafka is (mostly) controlled by the producer clients. Producers batch records together to create compressed batches, send those compressed batches to the Kafka brokers, and then the Kafka brokers write them to disk unchanged.
Similarly, when a Consumer client fetches records from the Kafka Broker, the broker will transmit the records "as is" over the wire without modification (using the SendFile syscall).
This means that in practice with Apache Kafka, the compression of batches both at rest and over the wire is almost completely controlled by the Producer clients.
How It Works in WarpStream
Unlike Apache Kafka, WarpStream's storage engine is not tightly coupled with the record-batch format. As a result, when batches are sent to the WarpStream Agents by producer clients, they're decompressed, encoded into WarpStream's file format, and then recompressed.
Similarly, when a Consumer client fetches records from a WarpStream Agent, the records are decoded/decompressed out of WarpStream's file format, and then re-encoded/compressed into Kafka's record-batch format. The practical implication of this is that users can control which compression algorithm is used to compress the record-batches that are sent to the consumer clients independently from the compression that was used by the storage engine and producers.
For example, in WarpStream it's possible to configure your producer clients to compress records using LZ4 over the wire, then have the WarpStream Agents store them using ZSTD in the object storage bucket, and finally transmit them as GZIP from the Agents to the consumer.
The table below compares Open Source Kafka with WarpStream in terms of who is in control of the compression algorithm for every operation.
Producer --> Broker/Agent
Producer Client
Producer Client
Storage (at rest)
Producer Client or Topic Configuration
Agent Configuration
Broker/Agent --> Consumer
Producer Client or Topic Configuration
Agent Configuration
Configuring Compression in WarpStream
Agent Level Configuration
There are two Agent level flags that control compression:
-storageCompression
(WARPSTREAM_STORAGE_COMPRESSION
)Default:
lz4
Will be changed to zstd in the future
Valid options:
lz4
,zstd
Topic level configuration override: N/A
-kafkaFetchCompression
(WARPSTREAM_KAFKA_FETCH_COMPRESSION
)Default:
lz4
Valid options:
lz4
,snappy
,gzip
,zstd
,none
Topic level configuration override:
warpstream.compression.type.fetch
Setting this value on the topic's configuration will override which compression algorithm is used to return compressed batches to consumers for this topic, regardless of which setting is configured in the Agents.
For example, setting
warpstream.compression.type.fetch=zstd
on topiclogs
would cause all consumers of thelogs
topic to receive record-batches compressed usingzstd
, even if the value ofkafkaFetchCompression
was set tolz4
.
Why Compression Still Matters in WarpStream
However, compression is still important for WarpStream clusters for two reasons:
First, workloads with long retention will still see significant cost savings, even when using object storage as the only storage tier, with improved compression ratios.
Second, even though networking is free, the networking capacity of the cloud VMs that the Agents are running on is not unlimited. Between producers, consumers, and background compactions, the WarpStream Agents are very network intensive. In general, we recommend running on network-optimized instances like m6in
in AWS, but even so, the best way to prevent the Agents from exceeding the network capacity of their VMs is to use a strong compression algorithm like ZSTD.
Difference with Kafka for Fetch Requests
There is one additional difference between WarpStream and Kafka with regards to compression and Fetch requests. When Kafka clients issue Fetch requests, they specify a limit on the amount of data to be returned (in aggregate, and per-partition) by the Broker. In Apache Kafka, the brokers interpret the limit in terms of compressed bytes, but in WarpStream the Agents interpret the limit in terms of uncompressed bytes.
WarpStream intentionally breaks from the behavior of Apache Kafka here primarily for safety reasons. A common issue with Apache Kafka is that something changes in the workload which dramatically improves the compression ratio and then causes the downstream consumers to begin OOMing. This is particularly problematic because the change could be something as benign as the upstream consumers changing compression algorithms, or sending a burst of highly repetitive and easily compressible data. This problem would have been exacerbated even further in WarpStream because the background compactions that the Agents perform improve data locality, and as a result, data compression.
WarpStream avoids all of these issues altogether by interpreting the limits as uncompressed bytes so that the amount of raw data returned in each Fetch remains the same regardless of fluctuations in the underlying compression ratio.
Last updated
Was this helpful?