Important Metrics and Logs

On this page, we include a sample list the most important logs and metrics emitted by the Agents.

Before reading this documentation page, please familiarize yourself with how logs and metrics are emitted from the Agents.

Enable High-Cardinality Tags

Some metrics support the use of tags with potentially high cardinality, such as tags based on topics. This feature is disabled by default.

To enable high-cardinality tags:

  • Use the command-line flag -kafkaHighCardinalityMetrics.

  • Alternatively, set the environment variable WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true.

Tags that require enabling are clearly marked with "(requires enabling high-cardinality tags)" next to their name.

Overview

System performance metrics and logs, focusing on error detection and resource consumption.

  • [logs] Error Logs

    • query: status:error

    • metric: *

  • [metrics] Memory usage by host

    • metric: container.memory.usage

    • group_by: host

  • [metrics] Used Cores

    • metric: container.cpu.usage

    • group_by:

  • [metrics] Used Cores By Host

    • metric: container.cpu.usage

    • group_by: host

  • [metrics] Circuit Breaker(multiple metrics)

    • metric 1

      • warpstream_circuit_breaker_count

      • tags: name state

      • type: counter

      • unit: n/a

    • metric 2

      • warpstream_circuit_breaker_permit

      • tags: name outcome

      • type: counter

      • unit: n/a

    • metric 3

      • warpstream_circuit_breaker_hits

      • tags: name outcome

      • type: counter

      • unit: n/a

    • metric 4

      • warpstream_circuit_breaker_state_set

      • tags: name state

      • type: gauge

      • unit: n/a

      • Note that this metric is similar to the warpstream_circuit_breaker_count but is a gauge, with values either 0 or 1, instead of a counter. It can be used to determine the current state of the circuit breaker.

Kafka

Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.

  • [metrics] Kafka Inflight Connections

    • metric: warpstream_agent_kafka_inflight_connections

    • group_by:

    • type: gauge

    • unit: n/a

    • description: number of currently inflight / active connections.

  • [metrics] Kafka Inflight Requests (per Connection)

    • metric: warpstream_agent_kafka_inflight_request_per_connection

    • group_by:

    • type: histogram

    • unit: n/a

    • description: number of currently in-flight Kafka protocol requests for individual connections.

  • [metrics] Kafka Latency

    • metric: warpstream_agent_kafka_request_latency

    • group_by: kafka_key

    • type: histogram

    • unit: seconds

    • description: latency for processing each Kafka protocol request.

  • [metrics] Kafka Requests

    • metric: warpstream_agent_kafka_request_outcome

    • group_by: kafka_key,outcome

    • type: counter

    • unit: n/a

    • description: outcome (success, error, etc) for each Kafka protocol request.

  • [metrics] Fetch Max Pointers in a Single Request

    • metric: warpstream_agent_kafka_fetch_num_pointers_distribution

    • group_by:

    • type: histogram

    • unit: n/a

    • description: distribution/histogram of number of different pointers to be processed in a single fetch request.

  • [metrics] Fetch Partial Bytes Due to Errors

    • metric: warpstream_agent_kafka_fetch_partial_response_error_scenario_num_bytes_distribution

    • group_by: source

    • type: histogram

    • unit: bytes

    • description: number of bytes that were returned from a fetch request as a result of an error that was encountered while trying to process the fetch request to completion. Can be useful for debugging performance issues, but not usually indicative of any issue.

  • [metrics] Fetch Throughput (uncompressed bytes)

    • metric: warpstream_agent_kafka_fetch_uncompressed_bytes_counter

    • group_by: topic (requires enabling high-cardinality tags)

    • type: counter

    • unit: bytes

    • description: number of uncompressed bytes that were fetched.

  • [metrics] Fetch Throughput (compressed bytes)

    • metric: warpstream_agent_kafka_fetch_compressed_bytes_counter

    • group_by: topic (requires enabling high-cardinality tags)

    • type: counter

    • unit: bytes

    • description: number of compressed bytes that were fetched.

  • [metrics] Produce Throughput (uncompressed bytes)

    • metric: warpstream_agent_kafka_produce_uncompressed_bytes_counter

    • group_by: topic (requires enabling high-cardinality tags)

    • type: counter

    • unit: bytes

    • description: number of uncompressed bytes that were produced.

  • [metrics] Produce Throughput (compressed bytes)

    • metric: warpstream_agent_kafka_produce_compressed_bytes_counter

  • [metrics] Produce Throughput (records)

    • metric: warpstream_agent_segment_batcher_flush_num_records_counter

    • group_by:

    • type: counter

    • unit: bytes

    • description: number of records that were produced.

  • [metrics] Consumer Groups Lag

    • metric: warpstream_consumer_group_lag

    • group_by: virtual_cluster_id, topic, consumer_group, partition

    • tags: virtual_cluster_id, topic, consumer_group, partition

    • type: gauge

    • unit: Kafka offsets

    • description: consumer group lag measured in offsets.

    • note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

  • [metrics] Max Offset

    • metric: warpstream_max_offset

    • group_by: virtual_cluster_id, topic, partition

    • tags: virtual_cluster_id, topic, partition

    • type: gauge

    • unit: Kafka offset

    • description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.

    • note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

Schema Registry

Metrics and logs associated with the hosted schema registry provide insights into request handling, latency, and throughput.

[metrics] Schema Registry Inflight Connections

  • metric: warpstream_agent_schema_registry_inflight_connections

  • group_by:

  • type: gauge

  • unit: n/a

  • description: number of currently inflight / active connections.

[metrics] Schema Registry Latency

  • metric: warpstream_agent_schema_registry_request_latency

  • group_by: schema_registry_operation,outcome

  • type: histogram

  • unit: seconds

  • description: latency for processing each Schema protocol request.

[metrics] Schema Registry Requests

  • metric: warpstream_agent_schema_registry_outcome

  • group_by: schema_registry_operation,outcome

  • type: counter

  • unit: n/a

  • description: outcome (success, error, etc) for each Schema Registry request.

[metrics] Request Throughput

  • metric: warpstream_agent_schema_registry_request_bytes_counter

  • group_by: schema_registry_operation

  • type: counter

  • unit: bytes

  • description: number of bytes of incoming requests.

[metrics] Response Throughput

  • metric: warpstream_agent_schema_registry_response_bytes_counter

  • group_by: schema_registry_operation,outcome

  • type: counter

  • unit: bytes

  • description: number of bytes of outgoing response.

Background Jobs

Metrics and logs on efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.

  • [logs] Compaction File Output Size

    • query: status:info

    • metric: @stream_job_output.compaction.file_metadatas.index_offset

    • group_by: source,@stream_job_input.compaction.compaction_level

    • description: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.

  • [logs] Compactions by Status and Level

    • query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info

    • metric: *

    • group_by: status,@stream_job_input.compaction.compaction_level

    • description: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.

  • [metrics] Compaction Files per Level (Indicator of Compaction Lag)

    • metric: warpstream_files_count

    • group_by: compaction_level

    • type: gauge

    • unit: n/a

    • description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).

  • [logs] P99 Compaction Duration by Level

    • query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info

    • metric: @duration_ms

    • group_by: status,@stream_job_input.compaction.compaction_level

    • description: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).

  • [metrics] Dead Files Scanner: Checked vs Deleted Files

    • metric: warpstream_agent_deadscanner_outcomes

    • group_by: outcome

    • tags: outcome

    • type: counter

    • unit: n/a

    • description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.

  • [metrics] Executed Jobs

    • metric: warpstream_agent_run_and_ack_job_outcome

    • group_by: job_type

    • tags: job_type outcome

    • type: counter

    • unit: n/a

    • description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.

Object Storage

Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.

  • [metrics] S3 Operations (GET)

    • warpstream_blob_store_operation_latency

      • filter_tag: operation:get_stream, operation:get_stream_range

      • group_by: operation

      • type: histogram

      • unit: seconds

      • description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.

      Note: this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

  • [metrics] S3 Operations (PUT)

    • warpstream_blob_store_operation_latency

      • filter_tag: operation:put_bytes, operation:put_stream

      • group_by: operation

      • type: histogram

      • unit: seconds

      • description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.

      Note: this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

  • [metrics] Bytes copied into cache

    • metric: warpstream_agent_file_cache_server_get_range_copy_chunk_num_bytes_copied

    • group_by: host

    • tags: outcome

    • type: counter

    • unit: bytes

    • description: number of bytes that have been read out of the object store and copied into the Agents in-memory, distributed file cache.

  • [metrics] Cache Size (Bytes)

    • metric: warpstream_agent_file_cache_server_chunk_cache_curr_size_bytes

    • group_by: host

    • type: gauge

    • unit: bytes

    • description: size of the file cache in bytes.

  • [metrics] Cache Size (Entries)

    • metric: warpstream_agent_file_cache_server_chunk_cache_num_entries

    • group_by: host

    • type: gauge

    • unit: n/a

    • description: number of entries in the file cache.

  • [metrics] Direct or Remote Loads

    • metric: warpstream_agent_file_cache_client_fetch_local_or_remote_counter

    • group_by: source

    • tags: outcome source

    • type: counter

    • unit: n/a

    • description: counter indicating how the file cache decided to load data. remote and remote_batched indicate that the data to be fetched was small enough that the file cache decided to request the data from the Agent that is responsible for the requested byte range. direct means the data to be fetched was large enough that the file cache decided to load the data directly from object storage instead of requesting it from the Agent responsible for that byte range (this is an optimization that saves some networking and CPU for large reads that are already cost-effective).

  • [metrics] Fetch Pointers

    • metric: warpstream_agent_kafka_fetch_num_pointers_counter

    • group_by:

    • type: counter

    • unit: n/a

    • description: number of pointers required to serve fetch requests. There is no "right" value for this metric, but large increases in this value indicate that serving consumers is more difficult for the Agents and is usually caused by compaction lag (particularly L0/L1 lag), or an increase in the number of partitions that are being actively written to / queried.

  • [metrics] File Cache Bytes Transferred (Server)

    • metric: warpstream_agent_file_cache_server_get_stream_range_num_bytes_count

    • group_by: outcome,host

    • type: counter

    • unit: bytes

    • description: number of bytes served by the Agent's distributed file cache server. This effectively indicates the amount of inter-Agent communication that is happening (within an availability zone) to power the distributed file cache that powers the fetch path.

  • File Cache Latency (Client)

    • metric: warpstream_agent_file_cache_client_get_stream_range_latency

    • group_by: outcome

    • type: histogram

    • unit: seconds

    • description: observed latency for requesting data in the distributed file cache from an Agent acting as a client from another Agent acting as a server.

  • [metrics] File Cache Latency (Server)

    • metric: warpstream_agent_file_cache_server_get_stream_range_latency

    • group_by: outcome, host

    • type: histogram

    • unit: seconds

    • description: observed latency for serving data in the distributed file cache as measured by the Agent acting as the server.

  • [metrics] File Cache Outcomes (Server)

    • metric: warpstream_agent_file_cache_server_get_stream_range_outcome

    • group_by: outcome, host

    • type: counter

    • unit: n/a

    • description: outcome of distributed file cache operation (success, error, etc) as measured by the Agent acting as the server.

  • [metrics] File Cache Per Request Bytes Read Average (Server)

    • metric: warpstream_agent_file_cache_server_get_stream_range_num_bytes_distribution

    • group_by: outcome

    • type: histogram

    • unit: bytes

    • description: distribution / histogram of the number of bytes requested in individual RPCs to the distributed file cache as measured by the Agent acting as the server.

  • [metrics] Num Bytes Fetched by Size

    • metric: warpstream_agent_file_cache_server_fetch_size_num_bytes_counter

    • group_by: fetch_size

    • type: counter

    • unit: bytes

    • description: number of bytes requests from the underlying object store (via GET requests) by the distributed file cache, groupable by fetch_size which should usually be one of 1 MiB or 4 MiB.

  • [metrics] Num Fetches by Size

    • metric: warpstream_agent_file_cache_server_fetch_size_counter

    • group_by: fetch_size

    • type: counter

    • unit: n/a

    • description: counter of the number of GET requests issued to the underlying object store by the distributed file cache, groupable by fetch_size which should usually be one of 1 MiB or 4 MiB.

    • Note that this is a sister metric to warpstream_agent_file_cache_server_fetch_size_num_bytes_counter which counts the number of time the bytes counter is incremented.

Schema Validation

  • [metrics] Num

    • metric: schema_registry_validation_invalid_record

    • group_by: subject, topic, reason

    • type: counter

    • unit: n/a

    • description: counter of the number of invalid records that the agent detects when schema validation is enabled

Last updated