Monitor the Agents

Logging

By default, the WarpStream Agent is configured to run with log level info . However, this can be changed with the WARPSTREAM_LOG_LEVEL environment variable. For example, if the info level logs are too noisy for you, you can set WARPSTREAM_LOG_LEVEL=warn.

The WarpStream Agents have an additional special log level called analytics that can be enabled by setting WARPSTREAM_LOG_LEVEL=analytics. This enables extremely detailed JSON logging that can be loaded into a logging system that supports analytics to slice and dice Agent log events and obtain a deep understanding of the workload. However, this feature emits a lot of logs, so keep that in mind before enabling it.

Health Check

The Agent exposes an HTTP health check endpoint at $IP:8080/v1/status. A successful response is the string OK with a 200 status code.

Metrics

All WarpStream Agent metrics begin with the warpstream_agent prefix.

WarpStream agent metrics exposed via Prometheus will include the Prometheus namespace: warpstream, so metrics will include the aggregate prefix of: warpstream_warpstream_agent

Full List of Important Metrics/Logs

In Important Metrics and Logs you can find a full list of the most important metrics and logs for monitoring the agent.

Alerting on Metrics

In Recommended List of Alerts you will find a list of key metrics for which you should configure alerts to detect issues in your agent effectively.

Datadog

Prometheus Exporter (pull)

We recommend following the Datadog instructions for scraping Prometheus/OpenTelemetry metrics using the Datadog Agent. Configuration will vary from environment to environment, but you should end up with something like the following configuration (Kubernetes example):

spec:
  template:
    metadata:
      annotations:
        ad.datadoghq.com/warpstream-agent.checks: |
          {
            "openmetrics": {
              "init_config": {},
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "metrics": [".*"],
                  "send_distribution_buckets": true,
                  "collect_counters_with_distributions": true,
                  "max_returned_metrics": 2000
                }
              ]
            }
          }

Which specifies that the Datadog Agent should scrape the WarpStream agent at port 8080 for metrics, and that it should scrape all the custom metrics that the WarpStream agent exposes.

The Datadog Agent will scrape only 2,000 metrics by default. This limit may be too low for WarpStream if you have many topics and consumer groups and/or have high cardinality metrics enabled. If you observe dropped or missing metrics, consider increasing this value.

We also have a pre-made Datadog Dashboard that you can just import directly using the import JSON feature.

Statsd Client (Push)

Alternatively, the WarpStream Agents embed the Datadog statsd metrics client. So if you prefer to avoid scraping the Prometheus endpoint and use a push-based approach instead, you can configure the Agents to push statsd metrics to the Datadog Agent by setting the -enableDatadogMetrics flag or adding WARPSTREAM_ENABLE_DATADOG_METRICS=true as an environment variable.

Prometheus

The WarpStream agents expose a traditional Prometheus metrics endpoint that is enabled by default on port 8080.

Prometheus metrics will automatically be exposed on the Agent "internal port" which by default is the same as the Kinesis port which defaults to 8080.

If you set an explicit port override for the Kinesis port or the Agent "internal" port, then you'll need to update your Prometheus scrape configuration port as well.

All Prometheus metrics are exposed under the warpstream namespace (see the Metrics section above for more details).

Observability

The WarpStream Agent publishes the following metrics every minute that you can use to have insights on what is happening under the hood.

Some of the metrics, particularly the consumer group metrics, can become very high cardinality if the cluster contains a lot of topics or partitions. To reduce the cardinality of the consumer group lag metrics, you can either disable them entirely using the disableConsumerGroupMetrics flag or setting WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS=true as an environment variable.

Alternatively, you can disable specific high cardinality dimensions like partition by providing a comma-separated list of tags to drop using the disableConsumerGroupsMetricsTags=partition flag or as an environment variable WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS_TAGS=partition.

Kafka

Name
Description
Tags

warpstream_consumer_group_lag

Difference (in offsets) between the max offset and the committed offset for every active consumer group.

Tagged by virtual_cluster_id, topic, consumer_group and partition.

warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds

Gives a rough estimate of how far behind (in seconds) a consumer group is from the latest messages. Note: This is NOT for precise measurement; it's a coarse estimate.

Tagged by virtual_cluster_id, topic, consumer_group and partition.

warpstream_consumer_group_generation_id

A unique identifier that increases with every consumer group rebalance. This allows you to easily track the number and frequency of rebalances.

Tagged by virtual_cluster_id and consumer_group.

warpstream_consumer_group_max_offset

Max offset of a given topic-partition for every topic-partition in every consumer group.

Tagged by virtual_cluster_id, topic, consumer_group and partition.

warpstream_consumer_group_state

State of each consumer group (stable, rebalancing, empty, etc)

Tagged by consumer_group, group_state.

warpstream_consumer_group_num_members

Number of members in each consumer group.

Tagged by consumer_group.

warpstream_consumer_group_num_topics

Number of topics in each consumer group.

Tagged by consumer_group.

warpstream_consumer_group_num_partitions

Number of partitions in each consumer group.

Tagged by consumer_group.

warpstream_files_count

Number of files at each compaction level so that user's can monitor whether they are experiencing compaction lag.

Tagged by compaction_level (0, 1 or 2 for now)

warpstream_topics_count

Total number of topics in the cluster.

N.A

warpstream_topics_count_limit

Maximum number of topics allowed in the cluster. Request a limit increase from the WarpStream team if necessary.

N.A

warpstream_partitions_count

Total number of partitions in the cluster.

N.A

warpstream_partitions_count_limit

Maximum number of partitions allowed in the cluster. Request a limit increase from the WarpStream team if necessary.

N.A

Schema Registry

Name
Description
Tags

warpstream_schema_versions_count

Total number of schema versions in the schema registry cluster

N.A

warpstream_schema_versions_limit

Maximum number of schema versions allowed in the cluster. Request a limit increse from the WarpStream team if necessary.

N.A

Last updated