# Monitoring Consumer Groups

Most open source Kafka deployments use external tooling to monitor consumer group lag. Some of this tooling is compatible with WarpStream because it uses the public Kafka API, and others like Burrow are incompatible because they rely on internal implementation details of Kafka like reading the internal consumer group offset topics.

Luckily, WarpStream has support for monitoring consumer groups built in, so no external tooling is required. In addition, WarpStream reports consumer group lag measured **in time** as well as measured **in offsets**. See [our blog post about measuring consumer lag in time](https://www.warpstream.com/blog/the-kafka-metric-youre-not-using-stop-counting-messages-start-measuring-time) for more details about why this is valuable.

Consumer group metadata and lag is available in a variety of locations with WarpStream.

## UI

The WarpStream UI exposes consumer group metadata and lag. This is not useful for alerting purposes, but can be helpful when debugging consumers.

<figure><img src="https://77315434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FjB7FxO8ty4EXO4HsQP4E%2Fuploads%2Fgit-blob-c5813257747a94f1c5780117c1b71f05e67cd087%2FScreenshot%202025-02-14%20at%2012.17.08%E2%80%AFPM.png?alt=media" alt=""><figcaption><p>Click on an individual consumer group to see more details.</p></figcaption></figure>

## API

Consumer group lag is available through dedicated [our HTTP/JSON API](https://docs.warpstream.com/warpstream/reference/api-reference/monitoring/describe-all-consumer-groups).

## Metrics

{% hint style="warning" %}
**We recommend using the** [**hosted prometheus endpoint**](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/hosted-prometheus-endpoint) **for consumer group metrics rather than directly going through the Agents.**\
This can be helpful for workloads with a high number of topics / partitions where the the time series cardinality is already high and multiplying it by the unique Agent pod names would make it even higher.
{% endhint %}

The Agents expose [built-in metrics](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents) that you can scrape within your own environment. Included in these metrics are all the metrics you need to monitor your applications for consumer group lag.

Some of the metrics, particularly the consumer group metrics, can become very high cardinality if the cluster contains a lot of topics or partitions. To reduce the cardinality of the consumer group lag metrics, you can either disable them entirely using the `disableConsumerGroupMetrics` flag or setting `WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS=true` as an environment variable.

The most important metrics are `warpstream_consumer_group_lag` (lag in offsets per tuple of `<topic, consumer_group>`) and `warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds` which is a mouthful but can be used to configure alerts based on time instead of offset count.

{% hint style="info" %}
The `partition` tag is disabled by default to reduce cardinality. If you want to enable it, set the `disableConsumerGroupsMetricsTags` flag or `WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS_TAGS` environment variable to an empty string (the default value is "partition").\
\
When the `partition` tag is disabled, the `consumer_group_lag` metric will be the sum of the consumer group lag across the topic's partitions. The `warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds` metric will be the max of the estimated lag across the topic's partitions.
{% endhint %}

<table><thead><tr><th width="249.33333333333331">Name</th><th>Description</th><th>Tags</th></tr></thead><tbody><tr><td><code>warpstream_consumer_group_lag</code></td><td>Difference (in offsets) between the max offset and the committed offset for every active consumer group.</td><td><code>virtual_cluster_id</code>, <code>topic</code>, <code>consumer_group</code> and <code>partition</code></td></tr><tr><td><code>warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds</code></td><td><p>Gives a <em>rough estimate</em> of how far behind (in seconds) a consumer group is from the latest messages.</p><p><strong>Note:</strong> This is NOT for precise measurement; it's a coarse estimate.</p></td><td><code>virtual_cluster_id</code>, <code>topic</code>, <code>consumer_group</code> and <code>partition</code></td></tr><tr><td><code>warpstream_consumer_group_generation_id</code></td><td>A unique identifier that increases with every consumer group rebalance. This allows you to easily track the number and frequency of rebalances.</td><td><code>virtual_cluster_id</code> and <code>consumer_group</code></td></tr><tr><td><code>warpstream_consumer_group_max_offset</code></td><td>Max offset of a given topic-partition for every topic-partition in every consumer group.</td><td><code>virtual_cluster_id</code>, <code>topic</code>, <code>consumer_group</code> and <code>partition</code></td></tr><tr><td><code>warpstream_consumer_group_state</code></td><td>State of each consumer group (stable, rebalancing, empty, etc)</td><td><code>consumer_group</code>, <code>group_state</code></td></tr><tr><td><code>warpstream_consumer_group_num_members</code></td><td>Number of members in each consumer group.</td><td><code>consumer_group</code></td></tr><tr><td><code>warpstream_consumer_group_num_topics</code></td><td>Number of topics in each consumer group.</td><td><code>consumer_group</code></td></tr><tr><td><code>warpstream_consumer_group_num_partitions</code></td><td>Number of partitions in each consumer group.</td><td><code>consumer_group</code></td></tr></tbody></table>

## Measuring E2E Latency More Accurately

As suggested by its name, the `warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds` metric is coarse. For example, if the actual end-to-end (E2E) latency of an application is 800ms, this metric may report the E2E latency as high as 5-8s.

For most applications, this is sufficiently accurate for monitoring and alerting purposes, but some applications may require more fine-grained observability. In that case, the best approach is to monitor the E2E latency manually in your application.

This can be accomplished by emitting a metric for the delta between the current timestamp and the timestamp of each individual record in your application.

There are three different ways that you can assign a timestamp to individual records when they're produced so that they're available to your consumer application:

1. Every Kafka record has a built-in timestamp. If your application doesn't specifically override this value, then it will automatically be set to the current time by the producer client when the record was produced, or to the current time of the broker when the record was written to disk. Which value is used depends on the configured value of [`message.timestamp.type`](https://docs.warpstream.com/warpstream/kafka/reference/protocol-and-feature-support/topic-configuration-reference#message.timestamp.type) on your cluster / topic.
2. You can add a custom header to your Kafka records with the current timestamp when producing records.
3. You can add a custom field in the payload of your Kafka records with the current timestamp when producing records.

{% hint style="info" %}
Note that it's impossible for WarpStream to automate this measurement because the WarpStream Agents have no way to accurately measure at what time the consumer application actually received and processed the records. As a result, the `estimated_lag_very_coarse` metric has to wait for the records to be **committed** (which may happen many seconds after the records are processed) before it can consider them "processed" from an E2E latency perspective. That's why the built-in metric tends to over-estimate E2E latency by a non-trivial amount.

The `estimated_lag_very_coarse` metric also has to rely on some amount of linear interpolation for efficiency reasons which also makes it less accurate than the approach described in this section.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents/monitoring-consumer-groups.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
