Diagnostics
Diagnostics are Warpstream's feedback system, designed to provide actionable insights about the health, cost, and performance of your cluster. Each diagnostic measures specific conditions, evaluates their impact, and provides suggestions to address potential issues.
Check out this video to learn more about Diagnostics:
Dimensions of a Diagnostic
Each diagnostic has the following dimensions:
Type: The category of the diagnostic, such as
HealthorCost.Name: The component or system that the diagnostic evaluates.
Successful: Indicates whether the diagnostic has passed.
Severity: The impact level of the diagnostic, ranging from
LowtoCritical.Muted: Specifies whether the diagnostic is muted and should not generate alerts.
Viewing Diagnostics
Diagnostics can be accessed in two ways:
Console UI: The diagnostics are displayed in the Warpstream console interface, providing a visual overview of the cluster's health and performance.
Metrics (New Version): Diagnostics are also exposed as agent metrics, facilitating integration with existing monitoring systems. Each diagnostic generates a
warpstream_diagnostic_failuregauge metric, where a value of 1 signifies a failing diagnostic and 0 indicates a healthy status.This metric includes the following descriptive tags:
diagnostic_name: The specific name of the diagnostic check. Unlike the old version, the name is now normalized to snake case for consistency.diagnostic_type: The functional category of the diagnostic (e.g., health, cost).severity_low: 1 if the diagnostic failure severity is 'low', 0 if otherwise.severity_medium: 1 if the severity is 'medium', 0 if otherwise.severity_high: 1 if the severity is 'high', 0 if otherwise.severity_critical: 1 if the severity is 'critical', 0 if otherwise.muted: Indicates whether the diagnostic check has been temporarily suppressed (muted).
Metrics (Legacy Version): Each diagnostic generates a legacy
warpstream_diagnostic_statusmetric with the following tags:diagnostic_name: The name of the diagnostic.diagnostic_type: The functional category of the diagnostic.successful: A boolean indicating whether the diagnostic check was successful.muted: Whether the diagnostic has been muted.
These metrics support the creation of custom monitoring and alerting setups.
Diagnostics metrics are available both in the agents and exposed via the Prometheus endpoint for easy scraping by monitoring tools.
Examples of Diagnostics Catalog
Below is some of our diagnostics with a brief description of what each check covers. For the full list go to our console and check the Health tab for any cluster
ACL Denied
Detects access denials by ACLs (principal/resource/operation/API); verify privileges.
Agent Version
Detects agents running older versions; recommends upgrading to stay within supported and optimized releases.
Cluster Load per Group Role
Checks average and P90 CPU load per agent group/role combination to spot hotspots.
Consumer Groups Configs
Detects risky consumer group timeouts (rebalance/session/heartbeat) that can trigger rebalances.
Cross AZ Kafka Clients
Detects clients connecting across AZs (missing AZ hints); increases latency and cross‑AZ costs.
Embedded Pipeline
Detects pipelines running on agents that also run other roles; recommends dedicated pipeline agents.
Instance Networking
Detects non network‑optimized instance types; recommends network‑optimized for better throughput/latency.
Kafka Client Version
Detects clients using old API versions; upgrade to modern Kafka client versions.
Known Bad Agent Versions
Detects agents running versions with known issues; recommends upgrading beyond affected range(s).
Known Bad Kafka Clients
Detects client libraries/versions with known issues or poor idempotent performance.
Mixed Agent Versions
Detects multiple agent versions co‑existing for too long; standardize versions across agents.
Missing Roles
Ensures at least one agent runs critical roles: jobs, proxy-produce, and proxy-consume.
Partitions Limit
Warns when approaching the cluster partitions limit.
Pipeline Not Runnable
Pipelines are running but there are no agents with the pipelines role in the relevant agent group(s).
Produce Batches
Warns when batches per second are very high, indicating suboptimal client batching/idempotence.
Stuck Consumer
Detects consumers that have not progressed for a while (lagging at a fixed offset).
Tableflow tables Limit
Warns when approaching the Tableflow tables limit in a Datalake cluster.
Last updated
Was this helpful?