LogoLogo
WarpStream.comSlackDiscordContact UsCreate Account
  • Overview
    • Introduction
    • Architecture
      • Service Discovery
      • Write Path
      • Read Path
      • Life of a Request (Simplified)
    • Change Log
  • Getting Started
    • Install the WarpStream Agent / CLI
    • Run the Demo
    • "Hello World" for Apache Kafka
  • BYOC
    • Run the Agents Locally
    • Deploy the Agents
      • Object Storage Configuration
      • Kubernetes Known Issues
      • Rolling Restarts and Upgrades
    • Infrastructure as Code
      • Terraform Provider
      • Helm charts
      • Terraform Modules
    • Monitoring
      • Pre-made Datadog Dashboard
      • Pre-made Grafana Dashboard
      • Important Metrics and Logs
      • Recommended List of Alerts
      • Monitoring Consumer Groups
      • Hosted Prometheus Endpoint
    • Client Configuration
      • Tuning for Performance
      • Configure Clients to Eliminate AZ Networking Costs
        • Force Interzone Load Balancing
      • Configuring Kafka Client ID Features
      • Known Issues
    • Authentication
      • SASL Authentication
      • Mutual TLS (mTLS)
      • Basic Authentication
    • Advanced Agent Deployment Options
      • Agent Roles
      • Agent Groups
      • Protect Data in Motion with TLS Encryption
      • Low Latency Clusters
      • Network Architecture Considerations
      • Agent Configuration Reference
      • Reducing Infrastructure Costs
      • Client Configuration Auto-tuning
    • Hosted Metadata Endpoint
    • Managed Data Pipelines
      • Cookbooks
    • Schema Registry
      • WarpStream BYOC Schema Registry
      • Schema Validation
      • WarpStream Schema Linking
    • Port Forwarding (K8s)
    • Orbit
    • Enable SAML Single Sign-on (SSO)
    • Trusted Domains
    • Diagnostics
      • GoMaxProcs
      • Small Files
  • Reference
    • ACLs
    • Billing
      • Direct billing
      • AWS Marketplace
    • Benchmarking
    • Compression
    • Protocol and Feature Support
      • Kafka vs WarpStream Configuration Reference
      • Compacted topics
    • Secrets Overview
    • Security and Privacy Considerations
    • API Reference
      • API Keys
        • Create
        • Delete
        • List
      • Virtual Clusters
        • Create
        • Delete
        • Describe
        • List
        • DescribeConfiguration
        • UpdateConfiguration
      • Virtual Clusters Credentials
        • Create
        • Delete
        • List
      • Monitoring
        • Describe All Consumer Groups
      • Pipelines
        • List Pipelines
        • Create Pipeline
        • Delete Pipeline
        • Describe Pipeline
        • Create Pipeline Configuration
        • Change Pipeline State
      • Invoices
        • Get Pending Invoice
        • Get Past Invoice
    • CLI Reference
      • warpstream agent
      • warpstream demo
      • warpstream cli
      • warpstream playground
    • Integrations
      • Arroyo
      • AWS Lambda Triggers
      • ClickHouse
      • Debezium
      • Decodable
      • DeltaStream
      • docker-compose
      • DuckDB
      • ElastiFlow
      • Estuary
      • Fly.io
      • Imply
      • InfluxDB
      • Kestra
      • Materialize
      • MinIO
      • MirrorMaker
      • MotherDuck
      • Ockam
      • OpenTelemetry Collector
      • ParadeDB
      • Parquet
      • Quix Streams
      • Railway
      • Redpanda Console
      • RisingWave
      • Rockset
      • ShadowTraffic
      • SQLite
      • Streambased
      • Streamlit
      • Timeplus
      • Tinybird
      • Upsolver
    • Partitions Auto-Scaler (beta)
    • Serverless Clusters
Powered by GitBook
On this page
  • Introduction
  • Prerequisites
  • Step 1: Software Installation
  • Step 2: The Bento Script
  • Step 3: Run the app
  • Next Steps

Was this helpful?

  1. Reference
  2. Integrations

Parquet

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It forms the backbone of many datalake and table format systems.

PreviousParadeDBNextQuix Streams

Last updated 8 months ago

Was this helpful?

A video walkthrough can be found below:

Introduction

Prerequisites

  1. WarpStream credentials.

  2. A WarpStream cluster is up and running with a populated topic.

Step 1: Software Installation

Parquet does not need to be installed; it is a supported file format in Bento.

Step 2: The Bento Script

This script expects that you have exported your WarpStream credentials as environment variables. WarpStream will provide this command when you create a credential:

export BOOTSTRAP_HOST=<YOUR_BOOTSTRAP_BROKER> \
SASL_USERNAME=<YOUR_SASL_USERNAME> \
SASL_PASSWORD=<YOUR_SASL_PASSWORD>;

The following Bento script will perform the following actions:

  • Configure the connection information to WarpStream.

  • Set the consumer to bento_parquet.

  • Read from the topic products.

  • The output segment is where we will set up everything about the Parquet output.

  • The schema segment defines the table layout we will get from the JSON files in the topic. We have an exact match in this example, which makes it simple. More complex examples and mutations can also be done in Bento. Refer to the Bento docs for specifics.

  • The final outputs section gives us flexibility on the file naming and location where the Parquet files will be written. Remember, one file per 1,000 records.

  • Once run, the script will continue until stopped with ctrl+c.

input:
 kafka_franz:
   consumer_group: bento_parquet
   commit_period: 1s
   start_from_oldest: true
   tls:
     enabled: true
   sasl:
     - mechanism: PLAIN
       username: "${SASL_USERNAME}"
       password: "${SASL_PASSWORD}"
   seed_brokers:
     - <YOUR_WARPSTREAM_BOOTSTRAP_URL>:9092
   topics:
     - products

output:
  broker:
    pattern: fan_out
    batching:
      count: 1000
      processors:
        - parquet_encode:
            default_compression: zstd
            default_encoding: PLAIN
            schema:
              - name: productId
                type: BYTE_ARRAY
              - name: name
                type: BYTE_ARRAY
              - name: price
                type: DOUBLE
              - name: inventoryCount
                type: INT64
    outputs:
      - file:
          path: 'warp/${! timestamp_unix() }-${! uuid_v4() }.parquet'
          codec: all-bytes

Step 3: Run the app

From the command line, Bento can be run as follows:

bento -c myscript.yaml

This will start generating Parquet files in a directory called warp with a timestamp and UUID in the file name, like:

1724970577-71fde68c-7a04-45c1-8b4f-6a7461060ef7.parquet 
1724970589-3da21d72-a2e0-4c4d-95b0-329bd0912621.parquet
1724970577-97967e71-cc23-44b0-9f61-c2873f0b1c98.parquet 
1724970589-71ec2ec2-23ba-4b43-b5b6-8ee14aed3df2.parquet

You could then use DuckDB to query one of those as follows:

SELECT count(*) FROM '1724970589-71ec2ec2-23ba-4b43-b5b6-8ee14aed3df2.parquet';

Next Steps

Congratulations! You now know how to use Bento to create powerful pipelines from a WarpStream cluster to Parquet file format.

WarpStream is Apache Kafka compatible, and t is a column-oriented data file format that is very popular in the modern data stack. While there is no direct connection between Kafka and Parquet, the task can be accomplished with several third-party tools, some more complex and bloated than others. For this illustration, we will be using the open-source pipeline tool Bento. Bento is written in Go and is simple to install and use as a single binary. The pipeline scripts are written in YAML, which we will cover.

Have installed (covered below).

WarpStream account - get access to WarpStream by registering .

Bento is the open-source pipeline tool we will use to read from WarpStream and write to SQLite. It can be installed from source, binary, or as a docker container. Visit for the best instructions for your situation.

The batching segment gives you granular control over how many and how quickly you write the Parquet files. Because Parquet is columnar, you can't append to it, so we're telling Bento to create Parquet files with 1,000 records each. Learn more about batching .

Under processors we have parquet_encode. We'll do a quick run-through of the settings here, but you can learn more about the topic .

Apache Parque
Bento
here
GitHub
here
here