Important Metrics and Logs

On this page, we include a sample list the most important logs and metrics emitted by the Agents.

Before reading this documentation page, please familiarize yourself with how logs and metrics are emitted from the Agents.

Enable High-Cardinality Tags

Some metrics support the use of tags with potentially high cardinality, such as tags based on topics. This feature is disabled by default.

To enable high-cardinality tags:

Use the command-line flag -kafkaHighCardinalityMetrics.
Alternatively, set the environment variable WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true.

Tags that require enabling are clearly marked with "(requires enabling high-cardinality tags)" next to their name.

Furthermore, per-topic distribution / histogram metrics have 10x to 20x higher cardinality than even the regular high cardinality metrics, so any high cardinality (per-topic) metrics with type histogram will not be emitted with per-topic tags unless the -kafkaHighCardinalityDistributionMetrics flag or WARPSTREAM_KAFKA_HIGH_CARDINALITY_DISTRIBUTION_METRICS environment variable is set to true.

Datadog metrics

Starting from Warpstream Agent v679 all metrics on Datadog will start with warpstream. and no longer warpstream_ . All references to metrics in our doc will keep mentioning metrics starting with warpstream_ so you have to do the conversion when you are using Datadog and a Warpstream Agent recent enough.

You can fall back to the previous behavior by setting the WARPSTREAM_DATADOG_NORMALIZER_PREFIX_WITH_DOT environment variable to false.

This change comes along the official release of our Datadog integration, making all the Warpstream Agent metrics free if you install the integration (and use the new naming convention).

Overview

System performance metrics and logs, focusing on error detection and resource consumption.

[logs] Error Logs

query: status:error
metric: *
note that some amount of error logs can be normal depending on the situation. Please contact the WarpStream team if you think a particular error log is too noisy!

[metrics] Memory usage by host

metric: container.memory.usage
group_by: host

[metrics] Used Cores

metric: container.cpu.usage
group_by:

[metrics] Used Cores By Host

metric: container.cpu.usage
group_by: host

Kafka

Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.

[metrics] Produce Throughput (uncompressed bytes)

metric: warpstream_agent_kafka_produce_uncompressed_bytes_counter
group_by: topic (requires enabling high-cardinality tags)
type: counter
unit: bytes
description: number of uncompressed bytes that were produced.

[metrics] Produce Throughput (compressed bytes)

metric: warpstream_agent_kafka_produce_compressed_bytes_counter

[metrics] Produce Throughput (records)

metric: warpstream_agent_segment_batcher_flush_num_records_counter
group_by:
type: counter
unit: bytes
description: number of records that were produced.

[metrics] Fetch Throughput (uncompressed bytes)

metric: warpstream_agent_kafka_fetch_uncompressed_bytes_counter
group_by: topic (requires enabling high-cardinality tags)
type: counter
unit: bytes
description: number of uncompressed bytes that were fetched.

[metrics] Fetch Throughput (compressed bytes)

metric: warpstream_agent_kafka_fetch_compressed_bytes_counter
group_by: topic (requires enabling high-cardinality tags)
type: counter
unit: bytes
description: number of compressed bytes that were fetched.

[metrics] Consumer Groups Lag

metric: warpstream_consumer_group_lag
group_by: virtual_cluster_id, topic, consumer_group, partition
tags: virtual_cluster_id, topic, consumer_group, partition
type: gauge
unit: Kafka offsets
description: consumer group lag measured in offsets.
note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

[metrics] Kafka Inflight Connections

metric: warpstream_agent_kafka_inflight_connections
group_by:
type: gauge
unit: n/a
description: number of currently inflight / active connections.

[metrics] Kafka Inflight Requests (per Connection)

metric: warpstream_agent_kafka_inflight_request_per_connection
group_by:
type: histogram
unit: n/a
description: number of currently in-flight Kafka protocol requests for individual connections.

[metrics] Kafka Request Outcome

metric: warpstream_agent_kafka_request_outcome
group_by: kafka_key,outcome
type: counter
unit: n/a
description: outcome (success, error, etc) for each Kafka protocol request.

[metrics] Kafka Request Latency

metric: warpstream_agent_kafka_request_latency
group_by: kafka_key
type: histogram
unit: seconds
description: latency for processing each Kafka protocol request.

[metrics] Max Offset

metric: warpstream_max_offset
group_by: virtual_cluster_id, topic, partition
tags: virtual_cluster_id, topic, partition
type: gauge
unit: Kafka offset
description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.
note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

[metrics] Kafka Topic Count

metric: warpstream_topics_count
group_by: virtual_cluster_id
type: gauge
unit: n/a
description: how many topics are currently in your cluster

[metrics] Kafka Topic Limit

metric: warpstream_topics_count_limit
group_by: virtual_cluster_id
type: gauge
unit: n/a
description: how many topics are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed

[metrics] Kafka Partition Count

metric: warpstream_partitions_count
group_by: virtual_cluster_id
type: gauge
unit: n/a
description: how many partitions are currently in your cluster

[metrics] Kafka Partition Limit

metric: warpstream_partitions_count_limit
group_by: virtual_cluster_id
type: gauge
unit: n/a
description: how many partitions are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed

[metrics] Kafka Records Count

metric: warpstream_num_records
group_by: topic, virtual_cluster_id
type: gauge
unit: n/a
description: how many records are currently in a given topic or cluster
note this number might not match the number of active keys if you are using compacted topics

Control Plane

Visualizing WarpStream control plane latency and error rates can be useful for debugging.

[metrics] Operations Outcome

metric: warpstream_agent_control_plane_operation_counter
group_by: virtual_cluster_id, outcome, operation
tags: virtual_cluster_id, outcome, operation
type: counter
unit: request
description: count of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

[metrics] Operations Latency

metric: warpstream_agent_control_plane_operation_latency
group_by: virtual_cluster_id, outcome, operation
tags: virtual_cluster_id, outcome, operation
type: histogram
unit: seconds
description: latency of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

Schema Registry

Metrics and logs associated with the hosted schema registry provide insights into request handling, latency, and throughput.

Enable Schema Registry Request Logs

Logging schema registry requests is disabled by default.

To enable request logs:

Use the command-line flag -schemaRegistryEnableLogRequest.
Alternatively, set the environment variable WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true.

[metrics] Schema Registry Requests

metric: warpstream_agent_schema_registry_outcome
group_by: schema_registry_operation,outcome
type: counter
unit: n/a
description: outcome (success, error, etc) for each Schema Registry request.

[metrics] Schema Registry Latency

metric: warpstream_agent_schema_registry_request_latency
group_by: schema_registry_operation,outcome
type: histogram
unit: seconds
description: latency for processing each Schema protocol request.

[metrics] Schema Registry Inflight Connections

metric: warpstream_agent_schema_registry_inflight_connections
group_by:
type: gauge
unit: n/a
description: number of currently inflight / active connections.

[logs] Schema Registry Request Logs

query: service:sr-agent schema_registry_request
group_by: outcome, request_type, schema_id, subject, version
description: every schema registry request will emit a log with the following attributes: request_type and outcome. Some requests will also emit additional attributes such as schema_id, subject, etc if applicable. This is disabled by default and requires setting the command line flag -schemaRegistryEnableLogRequest or setting the environment variable WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true

[metrics] Num Invalid Records

metric: warpstream_schema_validation_invalid_record_count
group_by: topic, reason
type: counter
unit: n/a
description: counter of the number of invalid records that the agent detects when schema validation is enabled.
note that the topic would only be a tag if high cardinality is enabled.

[metrics] Schema Linking Number of Source subject versions

metric: warpstream_schema_linking_source_subject_versions_count
group_by: sync_id, config_id
type: gauge
unit: n/a
description: number of source subject versions that Schema Linking is currently managing

[metrics] Schema Linking Number of Newly Migrated Subject Versions

metric: warpstream_schema_linking_newly_migrated_subject_versions
group_by: sync_id, config_id
type: gauge
unit: n/a
description: number of newly migrated subject versions performed by the latest sync, this should usually be zero unless new schemas are found. Schemas are only new for one sync and the frequency is configurable with a default of 5m

[metrics] Schema Versions Count

metric: warpstream_schema_versions_count
group_by:
type: gauge
unit: n/a
description: number of schemas currently in your registry.

[metrics] Schema Versions Limit

metric: warpstream_schema_versions_limit
group_by:
type: gauge
unit: n/a
description: number of schemas allowed in your registry.

Background Jobs

The control plane assigns agents background jobs for things like compaction or retention. These are the metrics and logs on the efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.

[logs] Compactions by Status and Level

query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info
metric: *
group_by: status,@stream_job_input.compaction.compaction_level
description: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.

[metrics] Executed Jobs

metric: warpstream_agent_run_and_ack_job_outcome
group_by: job_type
tags: job_type outcome
type: counter
unit: n/a
description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.

[metrics] Compaction Files per Level (Indicator of Compaction Lag)

metric: warpstream_files_count
group_by: compaction_level
type: gauge
unit: n/a
description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).

[metrics] Dead Files Scanner: Checked vs Deleted Files

metric: warpstream_agent_deadscanner_outcomes
group_by: outcome
tags: outcome
type: counter
unit: n/a
description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.

[logs] P99 Compaction Duration by Level

query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info
metric: @duration_ms
group_by: status,@stream_job_input.compaction.compaction_level
description: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).

[logs] Compaction File Output Size

query: status:info
metric: @stream_job_output.compaction.file_metadatas.index_offset
group_by: source,@stream_job_input.compaction.compaction_level
description: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.

Object Storage

Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.

[metrics] S3 Operations (PUT)

metric: warpstream_blob_store_operation_latency
filter_tag: operation:put_bytes, operation:put_stream
group_by: operation
type: histogram
unit: seconds
description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.
note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

[metrics] S3 Operations (GET)

metric: warpstream_blob_store_operation_latency
filter_tag: operation:get_stream, operation:get_stream_range
group_by: operation
type: histogram
unit: seconds
description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.
note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

PreviousDiagnostics NextHosted Prometheus Endpoint

Last updated 2 months ago

Was this helpful?

Good night

hashtagOverview

hashtagKafka

hashtagControl Plane

hashtagSchema Registry

hashtagBackground Jobs

hashtagObject Storage