Parquet
Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It forms the backbone of many datalake and table format systems.
Last updated
Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It forms the backbone of many datalake and table format systems.
Last updated
A video walkthrough can be found below:
WarpStream is Apache Kafka compatible, and Apache Parquet is a column-oriented data file format that is very popular in the modern data stack. While there is no direct connection between Kafka and Parquet, the task can be accomplished with several third-party tools, some more complex and bloated than others. For this illustration, we will be using the open-source pipeline tool Bento. Bento is written in Go and is simple to install and use as a single binary. The pipeline scripts are written in YAML, which we will cover.
Have Bento installed (covered below).
WarpStream account - get access to WarpStream by registering here.
WarpStream credentials.
A WarpStream cluster is up and running with a populated topic.
Bento is the open-source pipeline tool we will use to read from WarpStream and write to SQLite. It can be installed from source, binary, or as a docker container. Visit GitHub for the best instructions for your situation.
Parquet does not need to be installed; it is a supported file format in Bento.
This script expects that you have exported your WarpStream credentials as environment variables. WarpStream will provide this command when you create a credential:
The following Bento script will perform the following actions:
Configure the connection information to WarpStream.
Set the consumer to bento_parquet
.
Read from the topic products
.
The output
segment is where we will set up everything about the Parquet output.
The batching
segment gives you granular control over how many and how quickly you write the Parquet files. Because Parquet is columnar, you can't append to it, so we're telling Bento to create Parquet files with 1,000 records each. Learn more about batching here.
Under processors
we have parquet_encode
. We'll do a quick run-through of the settings here, but you can learn more about the topic here.
The schema
segment defines the table layout we will get from the JSON files in the topic. We have an exact match in this example, which makes it simple. More complex examples and mutations can also be done in Bento. Refer to the Bento docs for specifics.
The final outputs
section gives us flexibility on the file naming and location where the Parquet files will be written. Remember, one file per 1,000 records.
Once run, the script will continue until stopped with ctrl+c
.
From the command line, Bento can be run as follows:
bento -c myscript.yaml
This will start generating Parquet files in a directory called warp
with a timestamp and UUID in the file name, like:
You could then use DuckDB to query one of those as follows:
Congratulations! You now know how to use Bento to create powerful pipelines from a WarpStream cluster to Parquet file format.