Important Metrics and Logs
On this page, we include a sample list the most important logs and metrics emitted by the Agents.
Before reading this documentation page, please familiarize yourself with how logs and metrics are emitted from the Agents.
Enable High-Cardinality Tags
Some metrics support the use of tags with potentially high cardinality, such as tags based on topics. This feature is disabled by default.
To enable high-cardinality tags:
Use the command-line flag
-kafkaHighCardinalityMetrics
.Alternatively, set the environment variable
WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true
.
Tags that require enabling are clearly marked with "(requires enabling high-cardinality tags)" next to their name.
Overview
System performance metrics and logs, focusing on error detection and resource consumption.
[logs] Error Logs
query:
status:error
metric: *
[metrics] Memory usage by host
metric:
container.memory.usage
group_by:
host
[metrics] Used Cores
metric:
container.cpu.usage
group_by:
[metrics] Used Cores By Host
metric:
container.cpu.usage
group_by:
host
[metrics] Circuit Breaker(multiple metrics)
metric 1
warpstream_circuit_breaker_count
tags:
name
state
type: counter
unit: n/a
metric 2
warpstream_circuit_breaker_permit
tags:
name
outcome
type: counter
unit: n/a
metric 3
warpstream_circuit_breaker_hits
tags:
name
outcome
type: counter
unit: n/a
metric 4
warpstream_circuit_breaker_state_set
tags:
name
state
type: gauge
unit: n/a
Note that this metric is similar to the
warpstream_circuit_breaker_count
but is a gauge, with values either 0 or 1, instead of a counter. It can be used to determine the current state of the circuit breaker.
Kafka
Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.
[metrics] Kafka Inflight Connections
metric:
warpstream_agent_kafka_inflight_connections
group_by:
type: gauge
unit: n/a
description: number of currently inflight / active connections.
[metrics] Kafka Inflight Requests (per Connection)
metric:
warpstream_agent_kafka_inflight_request_per_connection
group_by:
type: histogram
unit: n/a
description: number of currently in-flight Kafka protocol requests for individual connections.
[metrics] Kafka Latency
metric:
warpstream_agent_kafka_request_latency
group_by:
kafka_key
type: histogram
unit: seconds
description: latency for processing each Kafka protocol request.
[metrics] Kafka Requests
metric:
warpstream_agent_kafka_request_outcome
group_by:
kafka_key,outcome
type: counter
unit: n/a
description: outcome (success, error, etc) for each Kafka protocol request.
[metrics] Fetch Max Pointers in a Single Request
metric:
warpstream_agent_kafka_fetch_num_pointers_distribution
group_by:
type: histogram
unit: n/a
description: distribution/histogram of number of different pointers to be processed in a single fetch request.
[metrics] Fetch Partial Bytes Due to Errors
metric:
warpstream_agent_kafka_fetch_partial_response_error_scenario_num_bytes_distribution
group_by:
source
type: histogram
unit: bytes
description: number of bytes that were returned from a fetch request as a result of an error that was encountered while trying to process the fetch request to completion. Can be useful for debugging performance issues, but not usually indicative of any issue.
[metrics] Fetch Throughput (uncompressed bytes)
metric:
warpstream_agent_kafka_fetch_uncompressed_bytes_counter
group_by:
topic
(requires enabling high-cardinality tags)type: counter
unit: bytes
description: number of uncompressed bytes that were fetched.
[metrics] Fetch Throughput (compressed bytes)
metric:
warpstream_agent_kafka_fetch_compressed_bytes_counter
group_by:
topic
(requires enabling high-cardinality tags)type: counter
unit: bytes
description: number of compressed bytes that were fetched.
[metrics] Produce Throughput (uncompressed bytes)
metric:
warpstream_agent_kafka_produce_uncompressed_bytes_counter
group_by:
topic
(requires enabling high-cardinality tags)type: counter
unit: bytes
description: number of uncompressed bytes that were produced.
[metrics] Produce Throughput (compressed bytes)
metric:
warpstream_agent_kafka_produce_compressed_bytes_counter
[metrics] Produce Throughput (records)
metric:
warpstream_agent_segment_batcher_flush_num_records_counter
group_by:
type: counter
unit: bytes
description: number of records that were produced.
[metrics] Consumer Groups Lag
metric:
warpstream_consumer_group_lag
group_by:
virtual_cluster_id
,topic
,consumer_group
,partition
tags:
virtual_cluster_id
,topic
,consumer_group
,partition
type: gauge
unit: Kafka offsets
description: consumer group lag measured in offsets.
note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the
disableConsumerGroupsMetricsTags
flag (see agent configuration documentation)
[metrics] Max Offset
metric:
warpstream_max_offset
group_by:
virtual_cluster_id
,topic
,partition
tags:
virtual_cluster_id
,topic
,partition
type: gauge
unit: Kafka offset
description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.
note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the
disableConsumerGroupsMetricsTags
flag (see agent configuration documentation)
Background Jobs
Metrics and logs on efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.
[logs] Compaction File Output Size
query:
status:info
metric:
@stream_job_output.compaction.file_metadatas.index_offset
group_by:
source
,@stream_job_input.compaction.compaction_level
description: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.
[logs] Compactions by Status and Level
query:
service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info
metric: *
group_by:
status
,@stream_job_input.compaction.compaction_level
description: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.
[metrics] Compaction Files per Level (Indicator of Compaction Lag)
metric:
warpstream_files_count
group_by:
compaction_level
type: gauge
unit: n/a
description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).
[logs] P99 Compaction Duration by Level
query:
service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info
metric:
@duration_ms
group_by:
status
,@stream_job_input.compaction.compaction_level
description: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).
[metrics] Dead Files Scanner: Checked vs Deleted Files
metric:
warpstream_agent_deadscanner_outcomes
group_by:
outcome
tags:
outcome
type: counter
unit: n/a
description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.
[metrics] Executed Jobs
metric:
warpstream_agent_run_and_ack_job_outcome
group_by:
job_type
tags:
job_type
outcome
type: counter
unit: n/a
description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.
Object Storage
Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.
[metrics] S3 Operations (GET)
warpstream_blob_store_operation_latency
filter_tag:
operation:get_stream
,operation:get_stream_range
group_by: operation
type: histogram
unit: seconds
description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.
Note: this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
[metrics] S3 Operations (PUT)
warpstream_blob_store_operation_latency
filter_tag:
operation:put_bytes
,operation:put_stream
group_by: operation
type: histogram
unit: seconds
description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.
Note: this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
[metrics] Bytes copied into cache
metric:
warpstream_agent_file_cache_server_get_range_copy_chunk_num_bytes_copied
group_by:
host
tags:
outcome
type: counter
unit: bytes
description: number of bytes that have been read out of the object store and copied into the Agents in-memory, distributed file cache.
[metrics] Cache Size (Bytes)
metric:
warpstream_agent_file_cache_server_chunk_cache_curr_size_bytes
group_by:
host
type: gauge
unit: bytes
description: size of the file cache in bytes.
[metrics] Cache Size (Entries)
metric:
warpstream_agent_file_cache_server_chunk_cache_num_entries
group_by:
host
type: gauge
unit: n/a
description: number of entries in the file cache.
[metrics] Direct or Remote Loads
metric:
warpstream_agent_file_cache_client_fetch_local_or_remote_counter
group_by:
source
tags:
outcome
source
type: counter
unit: n/a
description: counter indicating how the file cache decided to load data.
remote
andremote_batched
indicate that the data to be fetched was small enough that the file cache decided to request the data from the Agent that is responsible for the requested byte range.direct
means the data to be fetched was large enough that the file cache decided to load the data directly from object storage instead of requesting it from the Agent responsible for that byte range (this is an optimization that saves some networking and CPU for large reads that are already cost-effective).
[metrics] Fetch Pointers
metric:
warpstream_agent_kafka_fetch_num_pointers_counter
group_by:
type: counter
unit: n/a
description: number of pointers required to serve fetch requests. There is no "right" value for this metric, but large increases in this value indicate that serving consumers is more difficult for the Agents and is usually caused by compaction lag (particularly L0/L1 lag), or an increase in the number of partitions that are being actively written to / queried.
[metrics] File Cache Bytes Transferred (Server)
metric:
warpstream_agent_file_cache_server_get_stream_range_num_bytes_count
group_by:
outcome
,host
type: counter
unit: bytes
description: number of bytes served by the Agent's distributed file cache server. This effectively indicates the amount of inter-Agent communication that is happening (within an availability zone) to power the distributed file cache that powers the fetch path.
File Cache Latency (Client)
metric:
warpstream_agent_file_cache_client_get_stream_range_latency
group_by:
outcome
type: histogram
unit: seconds
description: observed latency for requesting data in the distributed file cache from an Agent acting as a client from another Agent acting as a server.
[metrics] File Cache Latency (Server)
metric:
warpstream_agent_file_cache_server_get_stream_range_latency
group_by:
outcome
,host
type: histogram
unit: seconds
description: observed latency for serving data in the distributed file cache as measured by the Agent acting as the server.
[metrics] File Cache Outcomes (Server)
metric:
warpstream_agent_file_cache_server_get_stream_range_outcome
group_by:
outcome
,host
type: counter
unit: n/a
description: outcome of distributed file cache operation (success, error, etc) as measured by the Agent acting as the server.
[metrics] File Cache Per Request Bytes Read Average (Server)
metric:
warpstream_agent_file_cache_server_get_stream_range_num_bytes_distribution
group_by:
outcome
type: histogram
unit: bytes
description: distribution / histogram of the number of bytes requested in individual RPCs to the distributed file cache as measured by the Agent acting as the server.
[metrics] Num Bytes Fetched by Size
metric:
warpstream_agent_file_cache_server_fetch_size_num_bytes_counter
group_by:
fetch_size
type: counter
unit: bytes
description: number of bytes requests from the underlying object store (via GET requests) by the distributed file cache, groupable by
fetch_size
which should usually be one of1 MiB
or4 MiB
.
[metrics] Num Fetches by Size
metric:
warpstream_agent_file_cache_server_fetch_size_counter
group_by:
fetch_size
type: counter
unit: n/a
description: counter of the number of GET requests issued to the underlying object store by the distributed file cache, groupable by
fetch_size
which should usually be one of1 MiB
or4 MiB
.Note that this is a sister metric to
warpstream_agent_file_cache_server_fetch_size_num_bytes_counter
which counts the number of time the bytes counter is incremented.
Schema Validation
[metrics] Num
metric:
schema_registry_validation_invalid_record
group_by:
subject
,topic
,reason
type: counter
unit: n/a
description: counter of the number of invalid records that the agent detects when schema validation is enabled
Last updated