Databricks Unity

This pages describes how to integrate Tableflow with Databricks so that you can query Iceberg tables created by WarpStream in Databricks.

Integration Context & Limitations

To query Iceberg tables managed by Tableflow with Databricks, catalog federation is used. With this approach, Unity Catalog will populate a foreign catalog by crawling the external catalog. This allows Unity Catalog to act as a governance layer while the actual metadata remains managed by an external provider. Currently, federation in Unity is limited to a small set of catalogs, such as AWS Glue, Hive Metastore, or Snowflake Horizon, and does not support generic Iceberg REST endpoints. As such, this guide focuses on an example of setting up WarpStream Tableflow for Databricks Unity with AWS Glue as the integration path.

To enable access to your data, we use AWS Glue as an intermediate catalog. Your table metadata is synced to AWS Glue, which is then used to populated a Foreign Catalog in Databricks.

If your architecture requires connecting via a different supported catalog (e.g., syncing Tableflow to Snowflake and connecting that to Databricks), please reach out to us for assistance.

Schema Limitations & Workarounds

Unity Catalog has a known limitation regarding Iceberg schemas: it does not support NOT NULL constraints nested within arrays or maps.

If your schema contains these fields, queries may fail with the error:

[DELTA_NESTED_NOT_NULL_CONSTRAINT] Delta does not support NOT NULL constraints nested within arrays or maps.

Workarounds:

Option A: Modify Schema
Update your Iceberg schema to make nested fields optional (nullable). This allows the table to be queried using standard Databricks "SQL Warehouse" compute.
Option B: Use Cluster Compute
If you cannot modify the schema, you must use "Cluster Compute" (SQL Warehouses are not supported for this config) and enable the following Spark configuration to suppress the error:

spark.databricks.delta.constraints.allowUnenforcedNotNull.enabled = true

Prerequisites

Before starting, make sure you have:

Completed the AWS Glue integration setup so your WarpStream Tableflow tables are available in AWS Glue.
A Databricks workspace with Unity Catalog enabled.
Databricks Runtime 16.2 or above for Iceberg table support (currently in Public Preview).
SQL Warehouses must be Pro or Serverless.
The following privileges on the Unity Catalog metastore (metastore admins have these by default):
- CREATE SERVICE CREDENTIAL
- CREATE CONNECTION
- CREATE EXTERNAL LOCATION
- CREATE CATALOG

Integrate via AWS Glue

0. Set Up AWS Glue Integration

Before setting up the Databricks integration, you must follow the steps at AWS Glue to have your WarpStream Tableflow tables available in AWS Glue.

1. Create a Service Credential for AWS Glue Access

Databricks needs access to the AWS Glue API to crawl catalog metadata. This requires creating an IAM role with Glue-specific permissions and registering it as a Service Credential in Databricks.

Create the IAM Role

Create an IAM role that Databricks can assume, with the following policy. Scope the permissions to your specific Tableflow Glue database to avoid federating your entire Glue catalog:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetPartitions"
      ],
      "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/default",
        "arn:aws:glue:<region>:<account-id>:database/<your-tableflow-database>",
        "arn:aws:glue:<region>:<account-id>:table/<your-tableflow-database>/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole"
      ],
      "Resource": [
        "arn:aws:iam::<account-id>:role/<this-role-name>"
      ]
    }
  ]
}

Replace the placeholders:

<region> — the AWS region where your Glue catalog lives (e.g. us-east-1).
<account-id> — your AWS account ID.
<your-tableflow-database> — the database_name configured in your WarpStream Tableflow AWS Glue configuration.
<this-role-name> — the name of this IAM role (required for the sts:AssumeRole self-reference).

Key details about resource scoping:

The default database ARN must be included or Databricks will return an error.
Use /* wildcard for tables if you have multiple tables in the database.

If you have other databases and tables in your AWS Glue catalog, make sure to scope the IAM policy to only the Tableflow database. Otherwise, Databricks will attempt to federate all databases and tables in your Glue catalog, which can cause errors or timeouts.

Register the Service Credential in Databricks

Once the IAM role is created, register it as a service credential in Databricks.

Instructions: Follow the Databricks Guide: Create service credentials.

2. Create the Glue Connection

Create a connection object within Databricks to link to your AWS Glue environment. When creating the connection, use the following values:

Connection type: Hive Metastore
Metastore type: AWS Glue
Credential: Select the service credential created in Step 1
AWS Account ID: The AWS account ID where your Glue catalog is (same account used in the Glue setup)
AWS Region: The region where your Glue catalog is
Instructions: Follow the Databricks Guide: Create the connection.

If you are using the Databricks Catalog Explorer UI, the connection wizard can also create the foreign catalog in the same flow. Users who follow the UI wizard may be able to combine the connection and catalog creation (Step 4) into a single step.

3. Create a Storage Credential and External Location for S3 Access

Databricks also needs access to S3 to read the actual Iceberg data files. This requires a separate IAM role (distinct from the service credential in Step 1) with S3 read permissions, registered as a Storage Credential in Databricks. You then create an External Location that points to the S3 bucket where Tableflow stores its data.

The IAM role for the storage credential needs the following S3 permissions on your Tableflow bucket:

s3:GetObject
s3:ListBucket
s3:GetBucketLocation
Instructions: Follow the Databricks Guide: Create storage credential and external location.

4. Create the Foreign Catalog

Critical: When creating the catalog, you must set the Storage location field to your S3 bucket path (e.g., s3://your-tableflow-bucket). If this is omitted, Databricks will fail to read the Iceberg data.

Create the foreign catalog to mount the Glue database. When creating the catalog:

Set the Storage location to the S3 path where Tableflow stores data (e.g. s3://your-tableflow-bucket).
Set the Authorized paths to match your S3 bucket path. Tables outside these paths won't be queryable. This should be the same bucket path used in the external location from Step 3.
Instructions: Follow the Databricks Guide: Create a foreign catalog.

5. Query the Data

Once the catalog is created, your WarpStream tables will automatically appear in the Databricks UI (Catalog Explorer). You can now query them using standard SQL, Notebooks, or BI tools just like any other native table.

To reference a table in your queries, use the full three-level namespace:

SELECT * FROM [catalog_name].[glue_database_name].[table_name];

PreviousSnowflake NextBigQuery

Last updated 2 days ago

Was this helpful?

Good afternoon

hashtagIntegration Context & Limitations

hashtagSchema Limitations & Workarounds

hashtagPrerequisites

hashtagIntegrate via AWS Glue

hashtag0. Set Up AWS Glue Integration

hashtag1. Create a Service Credential for AWS Glue Access

hashtag2. Create the Glue Connection

hashtag3. Create a Storage Credential and External Location for S3 Access

hashtag4. Create the Foreign Catalog

hashtag5. Query the Data

Integration Context & Limitations

Schema Limitations & Workarounds

Prerequisites

Integrate via AWS Glue

0. Set Up AWS Glue Integration

1. Create a Service Credential for AWS Glue Access

2. Create the Glue Connection

3. Create a Storage Credential and External Location for S3 Access

4. Create the Foreign Catalog

5. Query the Data