Important Metrics and Logs

On this page, we include a sample list the most important logs and metrics emitted by the Agents.

Before reading this documentation page, please familiarize yourself with how logs and metrics are emitted from the Agents.

Datadog metrics

Starting from Warpstream Agent v679 all metrics on Datadog will start with warpstream. and no longer warpstream_ . All references to metrics in our doc will keep mentioning metrics starting with warpstream_ so you have to do the conversion when you are using Datadog and a Warpstream Agent recent enough.

You can fall back to the previous behavior by setting the WARPSTREAM_DATADOG_NORMALIZER_PREFIX_WITH_DOT environment variable to false.

This change comes along the official release of our Datadog integration, making all the Warpstream Agent metrics free if you install the integration (and use the new naming convention).

Overview

System performance metrics and logs, focusing on error detection and resource consumption.

[logs] Error Logs

  • query: status:error

  • metric: *

  • note that some amount of error logs can be normal depending on the situation. Please contact the WarpStream team if you think a particular error log is too noisy!

[metrics] Memory usage by host

  • metric: container.memory.usage

  • group_by: host

[metrics] Used Cores

  • metric: container.cpu.usage

  • group_by:

[metrics] Used Cores By Host

  • metric: container.cpu.usage

  • group_by: host

Kafka

Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.

[metrics] Produce Throughput (uncompressed bytes)

  • metric: warpstream_agent_kafka_produce_uncompressed_bytes_counter

  • group_by: topic (requires enabling high-cardinality tags)

  • type: counter

  • unit: bytes

  • description: number of uncompressed bytes that were produced.

[metrics] Produce Throughput (compressed bytes)

  • metric: warpstream_agent_kafka_produce_compressed_bytes_counter

[metrics] Produce Throughput (records)

  • metric: warpstream_agent_segment_batcher_flush_num_records_counter

  • group_by:

  • type: counter

  • unit: bytes

  • description: number of records that were produced.

[metrics] Fetch Throughput (uncompressed bytes)

  • metric: warpstream_agent_kafka_fetch_uncompressed_bytes_counter

  • group_by: topic (requires enabling high-cardinality tags)

  • type: counter

  • unit: bytes

  • description: number of uncompressed bytes that were fetched.

[metrics] Fetch Throughput (compressed bytes)

  • metric: warpstream_agent_kafka_fetch_compressed_bytes_counter

  • group_by: topic (requires enabling high-cardinality tags)

  • type: counter

  • unit: bytes

  • description: number of compressed bytes that were fetched.

[metrics] Consumer Groups Lag

  • metric: warpstream_consumer_group_lag

  • group_by: virtual_cluster_id, topic, consumer_group, partition

  • tags: virtual_cluster_id, topic, consumer_group, partition

  • type: gauge

  • unit: Kafka offsets

  • description: consumer group lag measured in offsets.

  • note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

[metrics] Kafka Inflight Connections

  • metric: warpstream_agent_kafka_inflight_connections

  • group_by:

  • type: gauge

  • unit: n/a

  • description: number of currently inflight / active connections.

[metrics] Kafka Inflight Requests (per Connection)

  • metric: warpstream_agent_kafka_inflight_request_per_connection

  • group_by:

  • type: histogram

  • unit: n/a

  • description: number of currently in-flight Kafka protocol requests for individual connections.

[metrics] Kafka Request Outcome

  • metric: warpstream_agent_kafka_request_outcome

  • group_by: kafka_key,outcome

  • type: counter

  • unit: n/a

  • description: outcome (success, error, etc) for each Kafka protocol request.

[metrics] Kafka Request Latency

  • metric: warpstream_agent_kafka_request_latency

  • group_by: kafka_key

  • type: histogram

  • unit: seconds

  • description: latency for processing each Kafka protocol request.

[metrics] Max Offset

  • metric: warpstream_max_offset

  • group_by: virtual_cluster_id, topic, partition

  • tags: virtual_cluster_id, topic, partition

  • type: gauge

  • unit: Kafka offset

  • description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.

  • note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

[metrics] Kafka Topic Count

  • metric: warpstream_topics_count

  • group_by: virtual_cluster_id

  • type: gauge

  • unit: n/a

  • description: how many topics are currently in your cluster

[metrics] Kafka Topic Limit

  • metric: warpstream_topics_count_limit

  • group_by: virtual_cluster_id

  • type: gauge

  • unit: n/a

  • description: how many topics are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed

[metrics] Kafka Partition Count

  • metric: warpstream_partitions_count

  • group_by: virtual_cluster_id

  • type: gauge

  • unit: n/a

  • description: how many partitions are currently in your cluster

[metrics] Kafka Partition Limit

  • metric: warpstream_partitions_count_limit

  • group_by: virtual_cluster_id

  • type: gauge

  • unit: n/a

  • description: how many partitions are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed

[metrics] Kafka Records Count

  • metric: warpstream_num_records

  • group_by: topic, virtual_cluster_id

  • type: gauge

  • unit: n/a

  • description: how many records are currently in a given topic or cluster

  • note this number might not match the number of active keys if you are using compacted topics

Control Plane

Visualizing WarpStream control plane latency and error rates can be useful for debugging.

[metrics] Operations Outcome

  • metric: warpstream_agent_control_plane_operation_counter

  • group_by: virtual_cluster_id, outcome, operation

  • tags: virtual_cluster_id, outcome, operation

  • type: counter

  • unit: request

  • description: count of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

[metrics] Operations Latency

  • metric: warpstream_agent_control_plane_operation_latency

  • group_by: virtual_cluster_id, outcome, operation

  • tags: virtual_cluster_id, outcome, operation

  • type: histogram

  • unit: seconds

  • description: latency of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

Schema Registry

Metrics and logs associated with the hosted schema registry provide insights into request handling, latency, and throughput.

[metrics] Schema Registry Requests

  • metric: warpstream_agent_schema_registry_outcome

  • group_by: schema_registry_operation,outcome

  • type: counter

  • unit: n/a

  • description: outcome (success, error, etc) for each Schema Registry request.

[metrics] Schema Registry Latency

  • metric: warpstream_agent_schema_registry_request_latency

  • group_by: schema_registry_operation,outcome

  • type: histogram

  • unit: seconds

  • description: latency for processing each Schema protocol request.

[metrics] Schema Registry Inflight Connections

  • metric: warpstream_agent_schema_registry_inflight_connections

  • group_by:

  • type: gauge

  • unit: n/a

  • description: number of currently inflight / active connections.

[logs] Schema Registry Request Logs

  • query: service:sr-agent schema_registry_request

  • group_by: outcome, request_type, schema_id, subject, version

  • description: every schema registry request will emit a log with the following attributes: request_type and outcome. Some requests will also emit additional attributes such as schema_id, subject, etc if applicable. This is disabled by default and requires setting the command line flag -schemaRegistryEnableLogRequest or setting the environment variable WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true

[metrics] Num Invalid Records

  • metric: warpstream_schema_validation_invalid_record_count

  • group_by: topic, reason

  • type: counter

  • unit: n/a

  • description: counter of the number of invalid records that the agent detects when schema validation is enabled.

  • note that the topic would only be a tag if high cardinality is enabled.

[metrics] Schema Linking Number of Source subject versions

  • metric: warpstream_schema_linking_source_subject_versions_count

  • group_by: sync_id, config_id

  • type: gauge

  • unit: n/a

  • description: number of source subject versions that Schema Linking is currently managing

[metrics] Schema Linking Number of Newly Migrated Subject Versions

  • metric: warpstream_schema_linking_newly_migrated_subject_versions

  • group_by: sync_id, config_id

  • type: gauge

  • unit: n/a

  • description: number of newly migrated subject versions performed by the latest sync, this should usually be zero unless new schemas are found. Schemas are only new for one sync and the frequency is configurable with a default of 5m

[metrics] Schema Versions Count

  • metric: warpstream_schema_versions_count

  • group_by:

  • type: gauge

  • unit: n/a

  • description: number of schemas currently in your registry.

[metrics] Schema Versions Limit

  • metric: warpstream_schema_versions_limit

  • group_by:

  • type: gauge

  • unit: n/a

  • description: number of schemas allowed in your registry.

Background Jobs

The control plane assigns agents background jobs for things like compaction or retention. These are the metrics and logs on the efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.

[logs] Compactions by Status and Level

  • query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info

  • metric: *

  • group_by: status,@stream_job_input.compaction.compaction_level

  • description: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.

[metrics] Executed Jobs

  • metric: warpstream_agent_run_and_ack_job_outcome

  • group_by: job_type

  • tags: job_type outcome

  • type: counter

  • unit: n/a

  • description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.

[metrics] Compaction Files per Level (Indicator of Compaction Lag)

  • metric: warpstream_files_count

  • group_by: compaction_level

  • type: gauge

  • unit: n/a

  • description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).

[metrics] Dead Files Scanner: Checked vs Deleted Files

  • metric: warpstream_agent_deadscanner_outcomes

  • group_by: outcome

  • tags: outcome

  • type: counter

  • unit: n/a

  • description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.

[logs] P99 Compaction Duration by Level

  • query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info

  • metric: @duration_ms

  • group_by: status,@stream_job_input.compaction.compaction_level

  • description: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).

[logs] Compaction File Output Size

  • query: status:info

  • metric: @stream_job_output.compaction.file_metadatas.index_offset

  • group_by: source,@stream_job_input.compaction.compaction_level

  • description: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.

Object Storage

Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.

[metrics] S3 Operations (PUT)

  • metric: warpstream_blob_store_operation_latency

  • filter_tag: operation:put_bytes, operation:put_stream

  • group_by: operation

  • type: histogram

  • unit: seconds

  • description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.

  • note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

[metrics] S3 Operations (GET)

  • metric: warpstream_blob_store_operation_latency

  • filter_tag: operation:get_stream, operation:get_stream_range

  • group_by: operation

  • type: histogram

  • unit: seconds

  • description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.

  • note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

Last updated

Was this helpful?