# Databricks Unity

## Integration Context & Limitations

To query Iceberg tables managed by Tableflow with Databricks, [catalog federation](https://docs.databricks.com/aws/en/query-federation/catalog-federation) is used. With this approach, Unity Catalog will populate a foreign catalog by crawling the external catalog. This allows Unity Catalog to act as a governance layer while the actual metadata remains managed by an external provider. Currently, federation in Unity is limited to a small set of catalogs, such as AWS Glue, Hive Metastore, or Snowflake Horizon, and does not support generic Iceberg REST endpoints. As such, this guide focuses on an example of setting up WarpStream Tableflow for Databricks Unity with AWS Glue as the integration path.

To enable access to your data, we use AWS Glue as an intermediate catalog. Your table metadata is synced to AWS Glue, which is then used to populated a Foreign Catalog in Databricks.

{% hint style="info" %}
If your architecture requires connecting via a different supported catalog (e.g., syncing Tableflow to Snowflake and connecting that to Databricks), please reach out to us for assistance.
{% endhint %}

## Schema Limitations & Workarounds

Unity Catalog has a known limitation regarding Iceberg schemas: it does not support `NOT NULL` constraints nested within arrays or maps.

If your schema contains these fields, queries may fail with the error:

`[DELTA_NESTED_NOT_NULL_CONSTRAINT] Delta does not support NOT NULL constraints nested within arrays or maps.`

Workarounds:

* **Option A: Modify Schema**

  Update your Iceberg schema to make nested fields optional (nullable). This allows the table to be queried using standard Databricks "SQL Warehouse" compute.
* **Option B: Use Cluster Compute**

  If you cannot modify the schema, you must use "Cluster Compute" (SQL Warehouses are not supported for this config) and enable the following Spark configuration to suppress the error:

```toml
spark.databricks.delta.constraints.allowUnenforcedNotNull.enabled = true
```

## Prerequisites

Before starting, make sure you have:

* Completed the [AWS Glue integration setup](https://docs.warpstream.com/warpstream/tableflow/catalogs-and-query-engines/aws-glue) so your WarpStream Tableflow tables are available in AWS Glue.
* A Databricks workspace with **Unity Catalog enabled**.
* **Databricks Runtime 16.2 or above** for Iceberg table support (currently in Public Preview).
* SQL Warehouses must be **Pro** or **Serverless**.
* The following privileges on the Unity Catalog metastore (metastore admins have these by default):
  * `CREATE SERVICE CREDENTIAL`
  * `CREATE CONNECTION`
  * `CREATE EXTERNAL LOCATION`
  * `CREATE CATALOG`

## Integrate via AWS Glue

### 0. Set Up AWS Glue Integration

Before setting up the Databricks integration, you must follow the steps at [aws-glue](https://docs.warpstream.com/warpstream/tableflow/catalogs-and-query-engines/aws-glue "mention") to have your WarpStream Tableflow tables available in AWS Glue.

### 1. Create a Service Credential for AWS Glue Access

Databricks needs access to the **AWS Glue API** to crawl catalog metadata. This requires creating an IAM role with Glue-specific permissions and registering it as a **Service Credential** in Databricks.

**Create the IAM Role**

Create an IAM role that Databricks can assume, with the following policy. Scope the permissions to your specific Tableflow Glue database to avoid federating your entire Glue catalog:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetPartitions"
      ],
      "Resource": [
        "arn:aws:glue:<region>:<account-id>:catalog",
        "arn:aws:glue:<region>:<account-id>:database/default",
        "arn:aws:glue:<region>:<account-id>:database/<your-tableflow-database>",
        "arn:aws:glue:<region>:<account-id>:table/<your-tableflow-database>/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole"
      ],
      "Resource": [
        "arn:aws:iam::<account-id>:role/<this-role-name>"
      ]
    }
  ]
}
```

Replace the placeholders:

* `<region>` — the AWS region where your Glue catalog lives (e.g. `us-east-1`).
* `<account-id>` — your AWS account ID.
* `<your-tableflow-database>` — the `database_name` configured in your WarpStream Tableflow AWS Glue configuration.
* `<this-role-name>` — the name of this IAM role (required for the `sts:AssumeRole` self-reference).

Key details about resource scoping:

* The `default` database ARN **must** be included or Databricks will return an error.
* Use `/*` wildcard for tables if you have multiple tables in the database.

{% hint style="warning" %}
If you have other databases and tables in your AWS Glue catalog, make sure to scope the IAM policy to only the Tableflow database. Otherwise, Databricks will attempt to federate **all** databases and tables in your Glue catalog, which can cause errors or timeouts.
{% endhint %}

**Register the Service Credential in Databricks**

Once the IAM role is created, register it as a service credential in Databricks.

* **Instructions:** Follow the [Databricks Guide: Create service credentials](https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-services/service-credentials).

### 2. Create the Glue Connection

Create a connection object within Databricks to link to your AWS Glue environment. When creating the connection, use the following values:

* **Connection type:** Hive Metastore
* **Metastore type:** AWS Glue
* **Credential:** Select the service credential created in Step 1
* **AWS Account ID:** The AWS account ID where your Glue catalog is (same account used in the Glue setup)
* **AWS Region:** The region where your Glue catalog is
* **Instructions:** Follow the [Databricks Guide: Create the connection](https://docs.databricks.com/aws/en/query-federation/hms-federation-glue#create-the-connection).

{% hint style="info" %}
If you are using the Databricks Catalog Explorer UI, the connection wizard can also create the foreign catalog in the same flow. Users who follow the UI wizard may be able to combine the connection and catalog creation (Step 4) into a single step.
{% endhint %}

### 3. Create a Storage Credential and External Location for S3 Access

Databricks also needs access to **S3** to read the actual Iceberg data files. This requires a **separate** IAM role (distinct from the service credential in Step 1) with S3 read permissions, registered as a **Storage Credential** in Databricks. You then create an **External Location** that points to the S3 bucket where Tableflow stores its data.

The IAM role for the storage credential needs the following S3 permissions on your Tableflow bucket:

* `s3:GetObject`
* `s3:ListBucket`
* `s3:GetBucketLocation`
* **Instructions:** Follow the [Databricks Guide: Create storage credential and external location](https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/s3/s3-external-location-manual).

### 4. Create the Foreign Catalog

{% hint style="danger" %}
**Critical:** When creating the catalog, you **must** set the **Storage location** field to your S3 bucket path (e.g., `s3://your-tableflow-bucket`). If this is omitted, Databricks will fail to read the Iceberg data.
{% endhint %}

Create the foreign catalog to mount the Glue database. When creating the catalog:

* Set the **Storage location** to the S3 path where Tableflow stores data (e.g. `s3://your-tableflow-bucket`).
* Set the **Authorized paths** to match your S3 bucket path. Tables outside these paths won't be queryable. This should be the same bucket path used in the external location from Step 3.
* **Instructions:** Follow the [Databricks Guide: Create a foreign catalog](https://docs.databricks.com/aws/en/query-federation/hms-federation-glue#step-3-create-a-foreign-catalog).

### 5. Query the Data

Once the catalog is created, your WarpStream tables will automatically appear in the Databricks UI (Catalog Explorer). You can now query them using standard SQL, Notebooks, or BI tools just like any other native table.

To reference a table in your queries, use the full three-level namespace:

```sql
SELECT * FROM [catalog_name].[glue_database_name].[table_name];
```
