# Diagnostics

Diagnostics are Warpstream's feedback system, designed to provide actionable insights about the health, cost, and performance of your cluster. Each diagnostic measures specific conditions, evaluates their impact, and provides suggestions to address potential issues.

Check out this video to learn more about Diagnostics:

{% embed url="<https://www.youtube.com/watch?v=ZrdqgQS8yq8>" %}

## Dimensions of a Diagnostic

Each diagnostic has the following dimensions:

* **Type**: The category of the diagnostic, such as `Health` or `Cost`.
* **Name**: The component or system that the diagnostic evaluates.
* **Successful**: Indicates whether the diagnostic has passed.
* **Severity**: The impact level of the diagnostic, ranging from `Low` to `Critical`.
* **Muted**: Specifies whether the diagnostic is muted and should not generate alerts.

## Viewing Diagnostics

Diagnostics can be accessed in two ways:

1. **Console UI**: The diagnostics are displayed in the Warpstream console interface, providing a visual overview of the cluster's health and performance.
2. **Metrics** (New Version): Diagnostics are also exposed as agent metrics, facilitating integration with existing monitoring systems. Each diagnostic generates a `warpstream_diagnostic_failure` gauge metric, where a value of 1 signifies a failing diagnostic and 0 indicates a healthy status.

   This metric includes the following descriptive tags:

   * `diagnostic_name`: The specific name of the diagnostic check. Unlike the old version, the name is now normalized to snake case for consistency.
   * `diagnostic_type`: The functional category of the diagnostic (e.g., health, cost).
   * `severity_low`: 1 if the diagnostic failure severity is 'low', 0 if otherwise.
   * `severity_medium`: 1 if the severity is 'medium', 0 if otherwise.
   * `severity_high`: 1 if the severity is 'high', 0 if otherwise.
   * `severity_critical`: 1 if the severity is 'critical', 0 if otherwise.
   * `muted`: Indicates whether the diagnostic check has been temporarily suppressed (muted).
3. **Metrics** (Legacy Version): Each diagnostic generates a legacy `warpstream_diagnostic_status` metric with the following tags:

   * `diagnostic_name`: The name of the diagnostic.
   * `diagnostic_type`: The functional category of the diagnostic.
   * `successful`: A boolean indicating whether the diagnostic check was successful.
   * `muted`: Whether the diagnostic has been muted.

   These metrics support the creation of custom monitoring and alerting setups.

{% hint style="success" %}
Diagnostics metrics are available both **in the agents and** exposed via the **Prometheus endpoint** for easy scraping by monitoring tools.
{% endhint %}

## Examples of Diagnostics Catalog

Below is some of our diagnostics with a brief description of what each check covers. For the full list go to our console and check the Health tab for any cluster

| Diagnostic                  | Description                                                                                                  |
| --------------------------- | ------------------------------------------------------------------------------------------------------------ |
| ACL Denied                  | Detects access denials by ACLs (principal/resource/operation/API); verify privileges.                        |
| Agent Version               | Detects agents running older versions; recommends upgrading to stay within supported and optimized releases. |
| Cluster Load per Group Role | Checks average and P90 CPU load per agent group/role combination to spot hotspots.                           |
| Consumer Groups Configs     | Detects risky consumer group timeouts (rebalance/session/heartbeat) that can trigger rebalances.             |
| Cross AZ Kafka Clients      | Detects clients connecting across AZs (missing AZ hints); increases latency and cross‑AZ costs.              |
| Embedded Pipeline           | Detects pipelines running on agents that also run other roles; recommends dedicated pipeline agents.         |
| Instance Networking         | Detects non network‑optimized instance types; recommends network‑optimized for better throughput/latency.    |
| Kafka Client Version        | Detects clients using old API versions; upgrade to modern Kafka client versions.                             |
| Known Bad Agent Versions    | Detects agents running versions with known issues; recommends upgrading beyond affected range(s).            |
| Known Bad Kafka Clients     | Detects client libraries/versions with known issues or poor idempotent performance.                          |
| Mixed Agent Versions        | Detects multiple agent versions co‑existing for too long; standardize versions across agents.                |
| Missing Roles               | Ensures at least one agent runs critical roles: `jobs`, `proxy-produce`, and `proxy-consume`.                |
| Partitions Limit            | Warns when approaching the cluster partitions limit.                                                         |
| Pipeline Not Runnable       | Pipelines are running but there are no agents with the `pipelines` role in the relevant agent group(s).      |
| Produce Batches             | Warns when batches per second are very high, indicating suboptimal client batching/idempotence.              |
| Stuck Consumer              | Detects consumers that have not progressed for a while (lagging at a fixed offset).                          |
| Tableflow tables Limit      | Warns when approaching the Tableflow tables limit in a Datalake cluster.                                     |

To investigate a failing diagnostic, it's often helpful to introspect your cluster using the [Events Explorer](https://docs.warpstream.com/warpstream/reference/events). You can also access diagnostics and query events from your IDE using an AI assistant through the [MCP Server](https://docs.warpstream.com/warpstream/reference/mcp-server).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/diagnostics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
