# Monitoring Tableflow

Tableflow Agents expose a Prometheus endpoint on the internal port (default `8080`). Lag metrics and resource counts are also available on the [hosted Prometheus endpoint](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/hosted-prometheus-endpoint) without needing to scrape agents directly. All metrics use the `warpstream_` prefix. If you are using Datadog with Agent `v679` or later, the prefix is `warpstream.` instead (dot, not underscore). See [Set up Monitoring](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents) for general setup.

## Ingestion Lag

### Metrics

| Metric                                       | Unit    | Description                                                                                                                          | Min version |
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
| `warpstream_tableflow_query_lag_seconds`     | seconds | End-to-end delay until data is queryable: ingestion lag + Iceberg catalog sync staleness. **This is the primary metric to monitor.** | v778+       |
| `warpstream_tableflow_ingestion_lag_seconds` | seconds | Time since the last ingested record, per table. **0** when caught up.                                                                | v778+       |
| `warpstream_tableflow_partition_offset_lag`  | records | Number of records not yet ingested per table (high watermark minus last ingested offset).                                            | v776+       |

All lag metrics are tagged by `virtual_cluster_id`, `topic`, `table.name`, and `table.uuid`. On the hosted Prometheus endpoint, tags use underscores (`table_name`, `table_uuid`) and an additional `is_dlq` tag distinguishes between the main ingestion pipeline (`is_dlq="false"`) and the [DLQ replay](#dead-letter-queue-dlq) pipeline (`is_dlq="true"`). Ingestion lag is also visible in the WarpStream Console.

{% hint style="info" %}
`warpstream_tableflow_ingestion_lag_seconds` was previously named `warpstream_tableflow_partition_time_lag_seconds` (available since v710). The old name is deprecated starting in v778.
{% endhint %}

### How to interpret these metrics

**Query lag** (`warpstream_tableflow_query_lag_seconds`) is the metric you should use to understand how fresh the data is when you query your Iceberg tables. It captures the full delay from when data is produced to Kafka until it becomes visible to query engines like Spark, Trino, or DuckDB. This delay has two components: the ingestion itself and the Iceberg catalog sync.

**Ingestion lag** (`warpstream_tableflow_ingestion_lag_seconds`) helps you break down where the delay is coming from:

* If **query lag is high but ingestion lag is low**, the bottleneck is the catalog sync — data has been ingested into Iceberg files but the catalog metadata has not been refreshed yet.
* If **ingestion lag is high**, it means the ingestion pipeline itself is falling behind. This is typically caused by one of two things:
  * **An error** — check your cluster's [diagnostics](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/diagnostics) for any failing health checks, and look for warnings or errors in [events](https://docs.warpstream.com/warpstream/reference/events) (`tableflow_logs`) to identify the root cause.
  * **Insufficient capacity** — the agents cannot keep up with the produce rate on the source topics. Add more Tableflow Agents to increase ingestion throughput.

{% hint style="warning" %}
When ingestion lag is high, make sure your **source Kafka topic retention** is large enough so that records are not deleted before they can be ingested. If retention expires while the pipeline is behind, data will be permanently lost.
{% endhint %}

### Recommended alerts

* Alert on `warpstream_tableflow_query_lag_seconds` sustained above your freshness SLA (e.g., 600s).
* Alert on `warpstream_tableflow_ingestion_lag_seconds` sustained above a threshold (e.g., 300s) to catch pipeline issues early, before they affect queryability.

## Dead Letter Queue (DLQ)

When a table is configured with a [DLQ mode](https://docs.warpstream.com/warpstream/tableflow#dead-letter-queue-dlq-mode) (`stop`, `skip`, or `keep`), invalid records are handled according to that mode.

### Metrics

| Metric                                     | Unit    | Description                                                                                                 |
| ------------------------------------------ | ------- | ----------------------------------------------------------------------------------------------------------- |
| `warpstream_tableflow_dlq_records_counter` | records | Number of records handled by the DLQ during ingestion, tagged by `topic` and `strategy` (`skip` or `keep`). |

Use diagnostics and events for additional DLQ monitoring:

* Check your cluster's [diagnostics](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/diagnostics) for any Tableflow-related failures. Failing diagnostics indicate issues such as ingestion being stopped, records being skipped or routed to DLQ, or DLQ replay backlog.
* Check [events](https://docs.warpstream.com/warpstream/reference/events) (`tableflow_logs`) for warnings and errors related to DLQ activity — these provide per-record detail including failure reasons and affected topics.

{% hint style="warning" %}
If DLQ mode is `stop` and ingestion encounters invalid records, the pipeline will halt. This will also surface as high ingestion lag.
{% endhint %}

## Resource Counts

The hosted Prometheus endpoint exposes gauges for resource usage on your Tableflow cluster, tagged by `virtual_cluster_id`:

| Metric                                  | Description                                     |
| --------------------------------------- | ----------------------------------------------- |
| `warpstream_tableflow_tables_count`     | Number of tables in the cluster.                |
| `warpstream_tableflow_files_count`      | Number of Iceberg data files across all tables. |
| `warpstream_tableflow_snapshots_count`  | Number of Iceberg snapshots across all tables.  |
| `warpstream_tableflow_partitions_count` | Number of partitions across all tables.         |

These are useful for capacity planning and tracking cluster growth over time.

## Diagnostics and Events

[Diagnostics](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/diagnostics) are proactive health checks that run continuously on your cluster. They surface problems in the Console UI and as `warpstream_diagnostic_failure` gauge metrics (1 = failing, 0 = healthy) that you can alert on. Diagnostics cover infrastructure issues (bucket access, source cluster authentication), resource limits (table count), and ingestion health (DLQ activity, record errors).

[Events](https://docs.warpstream.com/warpstream/reference/events) provide detailed, per-occurrence context for troubleshooting. Tableflow clusters emit `tableflow_logs` (ingestion failures, table lifecycle, compaction, catalog sync, DLQ replay) and `agent_logs` (general agent operations). Events must be [enabled](https://docs.warpstream.com/warpstream/reference/events#enabling-events) on your cluster.

To investigate issues:

* Look for diagnostics in a **failing** state in the Console Health tab or by alerting on `warpstream_diagnostic_failure == 1`.
* Look for events with `data.log_level == "error"` or `data.log_level == "warn"` in the Events Explorer, scoped to `tableflow_logs`.

For general resource alerts (CPU, memory), see [Recommended List of Alerts](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/recommended-list-of-alerts). Tableflow Agents are stateless and can be auto-scaled based on CPU.
