Monitoring Tableflow

Key metrics, diagnostics, events, and recommended alerts for Tableflow clusters.

Tableflow Agents expose a Prometheus endpoint on the internal port (default 8080). Lag metrics and resource counts are also available on the hosted Prometheus endpoint without needing to scrape agents directly. All metrics use the warpstream_ prefix. If you are using Datadog with Agent v679 or later, the prefix is warpstream. instead (dot, not underscore). See Set up Monitoring for general setup.

Ingestion Lag

Metrics

Metric
Unit
Description
Min version

warpstream_tableflow_query_lag_seconds

seconds

End-to-end delay until data is queryable: ingestion lag + Iceberg catalog sync staleness. This is the primary metric to monitor.

v778+

warpstream_tableflow_ingestion_lag_seconds

seconds

Time since the last ingested record, per table. 0 when caught up.

v778+

warpstream_tableflow_partition_offset_lag

records

Number of records not yet ingested per table (high watermark minus last ingested offset).

v776+

All lag metrics are tagged by virtual_cluster_id, topic, table.name, and table.uuid. On the hosted Prometheus endpoint, tags use underscores (table_name, table_uuid) and an additional is_dlq tag distinguishes between the main ingestion pipeline (is_dlq="false") and the DLQ replay pipeline (is_dlq="true"). Ingestion lag is also visible in the WarpStream Console.

circle-info

warpstream_tableflow_ingestion_lag_seconds was previously named warpstream_tableflow_partition_time_lag_seconds (available since v710). The old name is deprecated starting in v778.

How to interpret these metrics

Query lag (warpstream_tableflow_query_lag_seconds) is the metric you should use to understand how fresh the data is when you query your Iceberg tables. It captures the full delay from when data is produced to Kafka until it becomes visible to query engines like Spark, Trino, or DuckDB. This delay has two components: the ingestion itself and the Iceberg catalog sync.

Ingestion lag (warpstream_tableflow_ingestion_lag_seconds) helps you break down where the delay is coming from:

  • If query lag is high but ingestion lag is low, the bottleneck is the catalog sync — data has been ingested into Iceberg files but the catalog metadata has not been refreshed yet.

  • If ingestion lag is high, it means the ingestion pipeline itself is falling behind. This is typically caused by one of two things:

    • An error — check your cluster's diagnostics for any failing health checks, and look for warnings or errors in events (tableflow_logs) to identify the root cause.

    • Insufficient capacity — the agents cannot keep up with the produce rate on the source topics. Add more Tableflow Agents to increase ingestion throughput.

circle-exclamation
  • Alert on warpstream_tableflow_query_lag_seconds sustained above your freshness SLA (e.g., 600s).

  • Alert on warpstream_tableflow_ingestion_lag_seconds sustained above a threshold (e.g., 300s) to catch pipeline issues early, before they affect queryability.

Dead Letter Queue (DLQ)

When a table is configured with a DLQ mode (stop, skip, or keep), invalid records are handled according to that mode.

Metrics

Metric
Unit
Description

warpstream_tableflow_dlq_records_counter

records

Number of records handled by the DLQ during ingestion, tagged by topic and strategy (skip or keep).

Use diagnostics and events for additional DLQ monitoring:

  • Check your cluster's diagnostics for any Tableflow-related failures. Failing diagnostics indicate issues such as ingestion being stopped, records being skipped or routed to DLQ, or DLQ replay backlog.

  • Check events (tableflow_logs) for warnings and errors related to DLQ activity — these provide per-record detail including failure reasons and affected topics.

circle-exclamation

Resource Counts

The hosted Prometheus endpoint exposes gauges for resource usage on your Tableflow cluster, tagged by virtual_cluster_id:

Metric
Description

warpstream_tableflow_tables_count

Number of tables in the cluster.

warpstream_tableflow_files_count

Number of Iceberg data files across all tables.

warpstream_tableflow_snapshots_count

Number of Iceberg snapshots across all tables.

warpstream_tableflow_partitions_count

Number of partitions across all tables.

These are useful for capacity planning and tracking cluster growth over time.

Diagnostics and Events

Diagnostics are proactive health checks that run continuously on your cluster. They surface problems in the Console UI and as warpstream_diagnostic_failure gauge metrics (1 = failing, 0 = healthy) that you can alert on. Diagnostics cover infrastructure issues (bucket access, source cluster authentication), resource limits (table count), and ingestion health (DLQ activity, record errors).

Events provide detailed, per-occurrence context for troubleshooting. Tableflow clusters emit tableflow_logs (ingestion failures, table lifecycle, compaction, catalog sync, DLQ replay) and agent_logs (general agent operations). Events must be enabled on your cluster.

To investigate issues:

  • Look for diagnostics in a failing state in the Console Health tab or by alerting on warpstream_diagnostic_failure == 1.

  • Look for events with data.log_level == "error" or data.log_level == "warn" in the Events Explorer, scoped to tableflow_logs.

For general resource alerts (CPU, memory), see Recommended List of Alerts. Tableflow Agents are stateless and can be auto-scaled based on CPU.

Last updated

Was this helpful?