Monitor the Agents
Logging
By default, the WarpStream Agent is configured to run with log level info
. However, this can be changed with the WARPSTREAM_LOG_LEVEL
environment variable. For example, if the info
level logs are too noisy for you, you can set WARPSTREAM_LOG_LEVEL=warn
.
The WarpStream Agents have an additional special log level called analytics
that can be enabled by setting WARPSTREAM_LOG_LEVEL=analytics
. This enables extremely detailed JSON logging that can be loaded into a logging system that supports analytics to slice and dice Agent log events and obtain a deep understanding of the workload. However, this feature emits a lot of logs, so keep that in mind before enabling it.
Health Check
The Agent exposes an HTTP health check endpoint at $IP:8080/v1/status
. A successful response is the string OK
with a 200
status code.
Metrics
All WarpStream Agent metrics begin with the warpstream_agent
prefix.
WarpStream agent metrics exposed via Prometheus will include the Prometheus namespace: warpstream
, so metrics will include the aggregate prefix of: warpstream_warpstream_agent
Full List of Important Metrics/Logs
In Important Metrics and Logs you can find a full list of the most important metrics and logs for monitoring the agent.
Alerting on Metrics
In Recommended List of Alerts you will find a list of key metrics for which you should configure alerts to detect issues in your agent effectively.
Datadog
Prometheus Exporter (pull)
We recommend following the Datadog instructions for scraping Prometheus/OpenTelemetry metrics using the Datadog Agent. Configuration will vary from environment to environment, but you should end up with something like the following configuration (Kubernetes example):
Which specifies that the Datadog Agent should scrape the WarpStream agent at port 8080
for metrics, and that it should scrape all the custom metrics that the WarpStream agent exposes.
The Datadog Agent will scrape only 2,000 metrics by default. This limit may be too low for WarpStream if you have many topics and consumer groups and/or have high cardinality metrics enabled. If you observe dropped or missing metrics, consider increasing this value.
We also have a pre-made Datadog Dashboard that you can just import directly using the import JSON feature.
Statsd Client (Push)
Alternatively, the WarpStream Agents embed the Datadog statsd metrics client. So if you prefer to avoid scraping the Prometheus endpoint and use a push-based approach instead, you can configure the Agents to push statsd metrics to the Datadog Agent by setting the -enableDatadogMetrics
flag or adding WARPSTREAM_ENABLE_DATADOG_METRICS=true
as an environment variable.
Prometheus
The WarpStream agents expose a traditional Prometheus metrics endpoint that is enabled by default on port 8080
.
Prometheus metrics will automatically be exposed on the Agent "internal port" which by default is the same as the Kinesis port which defaults to 8080
.
If you set an explicit port override for the Kinesis port or the Agent "internal" port, then you'll need to update your Prometheus scrape configuration port as well.
All Prometheus metrics are exposed under the warpstream
namespace (see the Metrics section above for more details).
Observability
The WarpStream Agent publishes the following metrics every minute that you can use to have insights on what is happening under the hood.
Some of the metrics, particularly the consumer group metrics, can become very high cardinality if the cluster contains a lot of topics or partitions. To reduce the cardinality of the consumer group lag metrics, you can either disable them entirely using the disableConsumerGroupMetrics
flag or setting WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS=true
as an environment variable.
Alternatively, you can disable specific high cardinality dimensions like partition
by providing a comma-separated list of tags to drop using the disableConsumerGroupsMetricsTags=partition
flag or as an environment variable WARPSTREAM_DISABLE_CONSUMER_GROUP_METRICS_TAGS=partition
.
Kafka
warpstream_consumer_group_lag
Difference (in offsets) between the max offset and the committed offset for every active consumer group.
Tagged by virtual_cluster_id
, topic
, consumer_group
and partition
.
warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds
Gives a rough estimate of how far behind (in seconds) a consumer group is from the latest messages. Note: This is NOT for precise measurement; it's a coarse estimate.
Tagged by virtual_cluster_id
, topic
, consumer_group
and partition
.
warpstream_consumer_group_generation_id
A unique identifier that increases with every consumer group rebalance. This allows you to easily track the number and frequency of rebalances.
Tagged by virtual_cluster_id
and consumer_group
.
warpstream_consumer_group_max_offset
Max offset of a given topic-partition for every topic-partition in every consumer group.
Tagged by virtual_cluster_id
, topic
, consumer_group
and partition
.
warpstream_consumer_group_state
State of each consumer group (stable, rebalancing, empty, etc)
Tagged by consumer_group
, group_state
.
warpstream_consumer_group_num_members
Number of members in each consumer group.
Tagged by consumer_group
.
warpstream_consumer_group_num_topics
Number of topics in each consumer group.
Tagged by consumer_group
.
warpstream_consumer_group_num_partitions
Number of partitions in each consumer group.
Tagged by consumer_group
.
warpstream_files_count
Number of files at each compaction level so that user's can monitor whether they are experiencing compaction lag.
Tagged by compaction_level
(0, 1 or 2 for now)
warpstream_topics_count
Total number of topics in the cluster.
N.A
warpstream_topics_count_limit
Maximum number of topics allowed in the cluster. Request a limit increase from the WarpStream team if necessary.
N.A
warpstream_partitions_count
Total number of partitions in the cluster.
N.A
warpstream_partitions_count_limit
Maximum number of partitions allowed in the cluster. Request a limit increase from the WarpStream team if necessary.
N.A
Schema Registry
warpstream_schema_versions_count
Total number of schema versions in the schema registry cluster
N.A
warpstream_schema_versions_limit
Maximum number of schema versions allowed in the cluster. Request a limit increse from the WarpStream team if necessary.
N.A
Last updated