Databricks Unity

This pages describes how to integrate Tableflow with Databricks so that you can query Iceberg tables created by WarpStream in Databricks.

Integration Context & Limitations

To query Iceberg tables managed by Tableflow with Databricks, catalog federationarrow-up-right is used. With this approach, Unity Catalog will populate a foreign catalog by crawling the external catalog. This allows Unity Catalog to act as a governance layer while the actual metadata remains managed by an external provider. Currently, federation in Unity is limited to a small set of catalogs, such as AWS Glue, Hive Metastore, or Snowflake Horizon, and does not support generic Iceberg REST endpoints. As such, this guide focuses on an example of setting up WarpStream Tableflow for Databricks Unity with AWS Glue as the integration path.

To enable access to your data, we use AWS Glue as an intermediate catalog. Your table metadata is synced to AWS Glue, which is then used to populated a Foreign Catalog in Databricks.

circle-info

If your architecture requires connecting via a different supported catalog (e.g., syncing Tableflow to Snowflake and connecting that to Databricks), please reach out to us for assistance.

Schema Limitations & Workarounds

Unity Catalog has a known limitation regarding Iceberg schemas: it does not support NOT NULL constraints nested within arrays or maps.

If your schema contains these fields, queries may fail with the error:

[DELTA_NESTED_NOT_NULL_CONSTRAINT] Delta does not support NOT NULL constraints nested within arrays or maps.

Workarounds:

  • Option A: Modify Schema

    Update your Iceberg schema to make nested fields optional (nullable). This allows the table to be queried using standard Databricks "SQL Warehouse" compute.

  • Option B: Use Cluster Compute

    If you cannot modify the schema, you must use "Cluster Compute" (SQL Warehouses are not supported for this config) and enable the following Spark configuration to suppress the error:

spark.databricks.delta.constraints.allowUnenforcedNotNull.enabled = true

Prerequisites

Before starting, make sure you have:

  • Completed the AWS Glue integration setup so your WarpStream Tableflow tables are available in AWS Glue.

  • A Databricks workspace with Unity Catalog enabled.

  • Databricks Runtime 16.2 or above for Iceberg table support (currently in Public Preview).

  • SQL Warehouses must be Pro or Serverless.

  • The following privileges on the Unity Catalog metastore (metastore admins have these by default):

    • CREATE SERVICE CREDENTIAL

    • CREATE CONNECTION

    • CREATE EXTERNAL LOCATION

    • CREATE CATALOG

Integrate via AWS Glue

0. Set Up AWS Glue Integration

Before setting up the Databricks integration, you must follow the steps at AWS Glue to have your WarpStream Tableflow tables available in AWS Glue.

1. Create a Service Credential for AWS Glue Access

Databricks needs access to the AWS Glue API to crawl catalog metadata. This requires creating an IAM role with Glue-specific permissions and registering it as a Service Credential in Databricks.

Create the IAM Role

Create an IAM role that Databricks can assume, with the following policy. Scope the permissions to your specific Tableflow Glue database to avoid federating your entire Glue catalog:

Replace the placeholders:

  • <region> — the AWS region where your Glue catalog lives (e.g. us-east-1).

  • <account-id> — your AWS account ID.

  • <your-tableflow-database> — the database_name configured in your WarpStream Tableflow AWS Glue configuration.

  • <this-role-name> — the name of this IAM role (required for the sts:AssumeRole self-reference).

Key details about resource scoping:

  • The default database ARN must be included or Databricks will return an error.

  • Use /* wildcard for tables if you have multiple tables in the database.

circle-exclamation

Register the Service Credential in Databricks

Once the IAM role is created, register it as a service credential in Databricks.

2. Create the Glue Connection

Create a connection object within Databricks to link to your AWS Glue environment. When creating the connection, use the following values:

  • Connection type: Hive Metastore

  • Metastore type: AWS Glue

  • Credential: Select the service credential created in Step 1

  • AWS Account ID: The AWS account ID where your Glue catalog is (same account used in the Glue setup)

  • AWS Region: The region where your Glue catalog is

circle-info

If you are using the Databricks Catalog Explorer UI, the connection wizard can also create the foreign catalog in the same flow. Users who follow the UI wizard may be able to combine the connection and catalog creation (Step 4) into a single step.

3. Create a Storage Credential and External Location for S3 Access

Databricks also needs access to S3 to read the actual Iceberg data files. This requires a separate IAM role (distinct from the service credential in Step 1) with S3 read permissions, registered as a Storage Credential in Databricks. You then create an External Location that points to the S3 bucket where Tableflow stores its data.

The IAM role for the storage credential needs the following S3 permissions on your Tableflow bucket:

4. Create the Foreign Catalog

triangle-exclamation

Create the foreign catalog to mount the Glue database. When creating the catalog:

  • Set the Storage location to the S3 path where Tableflow stores data (e.g. s3://your-tableflow-bucket).

  • Set the Authorized paths to match your S3 bucket path. Tables outside these paths won't be queryable. This should be the same bucket path used in the external location from Step 3.

5. Query the Data

Once the catalog is created, your WarpStream tables will automatically appear in the Databricks UI (Catalog Explorer). You can now query them using standard SQL, Notebooks, or BI tools just like any other native table.

To reference a table in your queries, use the full three-level namespace:

Last updated

Was this helpful?