Monitoring

Logging

By default, the WarpStream Agent is configured to run with log level info . However, this can be changed with the WARPSTREAM_LOG_LEVEL environment variable. For example, if the info level logs are too noisy for you, you can set WARPSTREAM_LOG_LEVEL=warn.

The WarpStream Agents have an additional special log level called analytics that can be enabled by setting WARPSTREAM_LOG_LEVEL=analytics. This enables extremely detailed JSON logging that can be loaded into a logging system that supports analytics to slice and dice Agent log events and obtain a deep understanding of the workload. However, this feature emits a lot of logs, so keep that in mind before enabling it.

Health Check

The Agent exposes an HTTP health check endpoint at $IP:8080/v1/status. A successful response is the string OK with a 200 status code.

Metrics

All WarpStream Agent metrics begin with the warpstream_agent prefix.

WarpStream agent metrics exposed via Prometheus will include the Prometheus namespace: warpstream, so metrics will include the aggregate prefix of: warpstream_warpstream_agent

Full List of Important Metrics/Logs

In Important Metrics and Logs you can find a full list of the most important metrics and logs for monitoring the agent.

Alerting on Metrics

In Recommended List of Alerts you will find a list of key metrics for which you should configure alerts to detect issues in your agent effectively.

Datadog

Prometheus Exporter (pull)

We recommend following the Datadog instructions for scraping Prometheus/OpenTelemetry metrics using the Datadog Agent. Configuration will vary from environment to environment, but you should end up with something like the following configuration (Kubernetes example):

spec:
  template:
    metadata:
      annotations:
        ad.datadoghq.com/warpstream-agent.checks: |
          {
            "openmetrics": {
              "init_config": {},
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "metrics": [".*"],
                  "send_distribution_buckets": true,
                  "collect_counters_with_distributions": true,
                  "max_returned_metrics": 2000
                }
              ]
            }
          }

Which specifies that the Datadog Agent should scrape the WarpStream agent at port 8080 for metrics, and that it should scrape all the custom metrics that the WarpStream agent exposes.

The Datadog Agent will scrape only 2,000 metrics by default. This limit may be too low for WarpStream if you have many topics and consumer groups and/or have high cardinality metrics enabled. If you observe dropped or missing metrics, consider increasing this value.

We also have a pre-made Datadog Dashboard that you can just import directly using the import JSON feature.

Statsd Client (Push)

Alternatively, the WarpStream Agents embed the Datadog statsd metrics client. So if you prefer to avoid scraping the Prometheus endpoint and use a push-based approach instead, you can configure the Agents to push statsd metrics to the Datadog Agent by setting the -enableDatadogMetrics flag or adding WARPSTREAM_ENABLE_DATADOG_METRICS=true as an environment variable. Additionally, to configure the Datadog client properly, the DD_AGENT_HOST environment variable needs to be set to the host IP. For Agents running in AWS with IMDS enabled this step can be skipped.

Prometheus

The WarpStream agents expose a traditional Prometheus metrics endpoint that is enabled by default on port 8080.

Prometheus metrics will automatically be exposed on the Agent "internal port" which by default is the same as the Kinesis port which defaults to 8080.

If you set an explicit port override for the Kinesis port or the Agent "internal" port, then you'll need to update your Prometheus scrape configuration port as well.

All Prometheus metrics are exposed under the warpstream namespace (see the Metrics section above for more details).

Observability

The WarpStream Agent publishes the following metrics every minute that you can use to have insights on what is happening under the hood.

Some of the metrics, particularly the consumer group metrics, can become very high cardinality if the cluster contains a lot of topics or partitions. To reduce the cardinality of the consumer group lag metrics, you can either disable them entirely using the disableConsumerGroupMetrics flag or setting WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS=true as an environment variable.

In addition, by the partition tag is disabled by default to reduce cardinality as well. If you want to enable it, set the disableConsumerGroupMetrics flag or WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS environment variable to an empty string (the default value is "partition").

Kafka

Name

Description

Schema Registry

Name

Description