Diagnostics

Diagnostics are Warpstream's feedback system, designed to provide actionable insights about the health, cost, and performance of your cluster. Each diagnostic measures specific conditions, evaluates their impact, and provides suggestions to address potential issues.

Check out this video to learn more about Diagnostics:

Dimensions of a Diagnostic

Each diagnostic has the following dimensions:

  • Type: The category of the diagnostic, such as Health or Cost.

  • Name: The component or system that the diagnostic evaluates.

  • Successful: Indicates whether the diagnostic has passed.

  • Severity: The impact level of the diagnostic, ranging from Low to Critical.

  • Muted: Specifies whether the diagnostic is muted and should not generate alerts.

Viewing Diagnostics

Diagnostics can be accessed in two ways:

  1. Console UI: The diagnostics are displayed in the Warpstream console interface, providing a visual overview of the cluster's health and performance.

  2. Metrics (New Version): Diagnostics are also exposed as agent metrics, facilitating integration with existing monitoring systems. Each diagnostic generates a warpstream_diagnostic_failure gauge metric, where a value of 1 signifies a failing diagnostic and 0 indicates a healthy status.

    This metric includes the following descriptive tags:

    • diagnostic_name: The specific name of the diagnostic check. Unlike the old version, the name is now normalized to snake case for consistency.

    • diagnostic_type: The functional category of the diagnostic (e.g., health, cost).

    • severity_low: 1 if the diagnostic failure severity is 'low', 0 if otherwise.

    • severity_medium: 1 if the severity is 'medium', 0 if otherwise.

    • severity_high: 1 if the severity is 'high', 0 if otherwise.

    • severity_critical: 1 if the severity is 'critical', 0 if otherwise.

    • muted: Indicates whether the diagnostic check has been temporarily suppressed (muted).

  3. Metrics (Legacy Version): Each diagnostic generates a legacy warpstream_diagnostic_status metric with the following tags:

    • diagnostic_name: The name of the diagnostic.

    • diagnostic_type: The functional category of the diagnostic.

    • successful: A boolean indicating whether the diagnostic check was successful.

    • muted: Whether the diagnostic has been muted.

    These metrics support the creation of custom monitoring and alerting setups.

Examples of Diagnostics Catalog

Below is some of our diagnostics with a brief description of what each check covers. For the full list go to our console and check the Health tab for any cluster

Diagnostic
Description

ACL Denied

Detects access denials by ACLs (principal/resource/operation/API); verify privileges.

Agent Version

Detects agents running older versions; recommends upgrading to stay within supported and optimized releases.

Cluster Load per Group Role

Checks average and P90 CPU load per agent group/role combination to spot hotspots.

Consumer Groups Configs

Detects risky consumer group timeouts (rebalance/session/heartbeat) that can trigger rebalances.

Cross AZ Kafka Clients

Detects clients connecting across AZs (missing AZ hints); increases latency and cross‑AZ costs.

Embedded Pipeline

Detects pipelines running on agents that also run other roles; recommends dedicated pipeline agents.

Instance Networking

Detects non network‑optimized instance types; recommends network‑optimized for better throughput/latency.

Kafka Client Version

Detects clients using old API versions; upgrade to modern Kafka client versions.

Known Bad Agent Versions

Detects agents running versions with known issues; recommends upgrading beyond affected range(s).

Known Bad Kafka Clients

Detects client libraries/versions with known issues or poor idempotent performance.

Mixed Agent Versions

Detects multiple agent versions co‑existing for too long; standardize versions across agents.

Missing Roles

Ensures at least one agent runs critical roles: jobs, proxy-produce, and proxy-consume.

Partitions Limit

Warns when approaching the cluster partitions limit.

Pipeline Not Runnable

Pipelines are running but there are no agents with the pipelines role in the relevant agent group(s).

Produce Batches

Warns when batches per second are very high, indicating suboptimal client batching/idempotence.

Stuck Consumer

Detects consumers that have not progressed for a while (lagging at a fixed offset).

Tableflow tables Limit

Warns when approaching the Tableflow tables limit in a Datalake cluster.

Last updated

Was this helpful?