DuckDB

DuckDB is an open-source, column-oriented, relational database management system (RDBMS) designed for analytical processing and interactive querying.

A video walkthrough can be found below:

Introduction

There is no direct connection to DuckDB from any Apache Kafka-compliant service. However, a DuckDB plug-in named Kwack provides this ability. This guide will explain how to connect the two systems together to allow you to perform analytics on your WarpStream-managed Topics.

Prerequisites

  1. Have DuckDB installed.

  2. Have a Kwack installed (requires Java 11 or higher).

  3. WarpStream account - get access to WarpStream by registering here.

  4. A Serverless WarpStream cluster is up and running with a populated topic.

Step 1: Get your WarpStream credentials

Obtain the Bootstrap Broker from the WarpStream console by navigating to your cluster and clicking the Connect tab. If you don't have SASL credentials, you can also create a set of credentials from the console.

Save these values for the next step.

Step 2: Prepare your Kwack parameters

Kwack can accept all the connection information and even SQL queries on the command line with various switches. A more easily reproducible method is to use a "properties" file, such as the one below:

# Topics to manage
topics=topic1

# Key serdes (default is binary)
key.serdes=topic1=string

# Value serdes (default is latest)
value.serdes=topic1=json:@/mypath/topic1_schema.json

# The bootstrap servers for your Kafka cluster
bootstrap.servers=<YOUR_BOOTSTRAP_BROKER>:<YOUR_PORT>
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username='<YOUR_SASL_USERNAME>' password='<YOUR_SASL_PASSWORD>';

A schema registry or a local file can describe your data in various formats. For this example, we use a local schema definition in JSON format. Assuming a simple "customers" layout, the JSON schema would look something like the following:

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
      "customerId": {
        "type": "string"
      },
      "name": {
        "type": "string"
      },
      "zone": {
        "type": "string"
      },
      "address": {
        "type": "string"
      },
      "membership": {
        "type": "string"
      }
    }
  }

If you have more than one topic to connect to, then those values are separated by commas as follows:

# Topics to manage
topics=topic1,topic2

# Key serdes (default is binary)
key.serdes=topic1=string,topic2=string

# Value serdes (default is latest)
value.serdes=topic1=json:@/mypath/topic1_schema.json,topic2=json:@/mypath/topic2_schema.json

# The bootstrap servers for your Kafka cluster
bootstrap.servers=<YOUR_BOOTSTRAP_BROKER>:<YOUR_PORT>
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username='<YOUR_SASL_USERNAME>' password='<YOUR_SASL_PASSWORD>';

Step 3: Consuming with Kwack

Kwack can combine a mixture of run-time switches and a property file. To launch Kwack with a properties file, use the -F switch, such as:

kwack -F myconfig.properties

At this point, you can perform SQL commands against the active Kafka topics in WarpStream, including joining multiple topics for analytics. The topics can be persisted into a DuckDB database with the -d switch, such as:

kwack -F myconfig.properties -d mydb.duckdb

Next Steps

Congratulations! You can now read your WarpStream topics directly with Kwack and optionally save them as a DuckDB database. Kwack can also export your topics as Parquet files, among many other useful features.

Last updated