Important Metrics and Logs
On this page, we include a sample list the most important logs and metrics emitted by the Agents.
Before reading this documentation page, please familiarize yourself with how logs and metrics are emitted from the Agents.
Enable High-Cardinality Tags
Some metrics support the use of tags with potentially high cardinality, such as tags based on topics. This feature is disabled by default.
To enable high-cardinality tags:
Use the command-line flag
-kafkaHighCardinalityMetrics.Alternatively, set the environment variable
WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true.
Tags that require enabling are clearly marked with "(requires enabling high-cardinality tags)" next to their name.
Furthermore, per-topic distribution / histogram metrics have 10x to 20x higher cardinality than even the regular high cardinality metrics, so any high cardinality (per-topic) metrics with type histogram will not be emitted with per-topic tags unless the -kafkaHighCardinalityDistributionMetrics flag or WARPSTREAM_KAFKA_HIGH_CARDINALITY_DISTRIBUTION_METRICS environment variable is set to true.
Overview
System performance metrics and logs, focusing on error detection and resource consumption.
[logs] Error Logs
query:
status:errormetric: *
note that some amount of error logs can be normal depending on the situation. Please contact the WarpStream team if you think a particular error log is too noisy!
[metrics] Memory usage by host
metric:
container.memory.usagegroup_by:
host
[metrics] Used Cores
metric:
container.cpu.usagegroup_by:
[metrics] Used Cores By Host
metric:
container.cpu.usagegroup_by:
host
Kafka
Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.
[metrics] Produce Throughput (uncompressed bytes)
metric:
warpstream_agent_kafka_produce_uncompressed_bytes_countergroup_by:
topic(requires enabling high-cardinality tags)type: counter
unit: bytes
description: number of uncompressed bytes that were produced.
[metrics] Produce Throughput (compressed bytes)
metric:
warpstream_agent_kafka_produce_compressed_bytes_counter
[metrics] Produce Throughput (records)
metric:
warpstream_agent_segment_batcher_flush_num_records_countergroup_by:
type: counter
unit: bytes
description: number of records that were produced.
[metrics] Fetch Throughput (uncompressed bytes)
metric:
warpstream_agent_kafka_fetch_uncompressed_bytes_countergroup_by:
topic(requires enabling high-cardinality tags)type: counter
unit: bytes
description: number of uncompressed bytes that were fetched.
[metrics] Fetch Throughput (compressed bytes)
metric:
warpstream_agent_kafka_fetch_compressed_bytes_countergroup_by:
topic(requires enabling high-cardinality tags)type: counter
unit: bytes
description: number of compressed bytes that were fetched.
[metrics] Consumer Groups Lag
metric:
warpstream_consumer_group_laggroup_by:
virtual_cluster_id,topic,consumer_group,partitiontags:
virtual_cluster_id,topic,consumer_group,partitiontype: gauge
unit: Kafka offsets
description: consumer group lag measured in offsets.
note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the
disableConsumerGroupsMetricsTagsflag (see agent configuration documentation)
[metrics] Kafka Inflight Connections
metric:
warpstream_agent_kafka_inflight_connectionsgroup_by:
type: gauge
unit: n/a
description: number of currently inflight / active connections.
[metrics] Kafka Inflight Requests (per Connection)
metric:
warpstream_agent_kafka_inflight_request_per_connectiongroup_by:
type: histogram
unit: n/a
description: number of currently in-flight Kafka protocol requests for individual connections.
[metrics] Kafka Request Outcome
metric:
warpstream_agent_kafka_request_outcomegroup_by:
kafka_key,outcometype: counter
unit: n/a
description: outcome (success, error, etc) for each Kafka protocol request.
[metrics] Kafka Request Latency
metric:
warpstream_agent_kafka_request_latencygroup_by:
kafka_keytype: histogram
unit: seconds
description: latency for processing each Kafka protocol request.
[metrics] Max Offset
metric:
warpstream_max_offsetgroup_by:
virtual_cluster_id,topic,partitiontags:
virtual_cluster_id,topic,partitiontype: gauge
unit: Kafka offset
description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.
note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the
disableConsumerGroupsMetricsTagsflag (see agent configuration documentation)
[metrics] Kafka Topic Count
metric:
warpstream_topics_countgroup_by:
virtual_cluster_idtype: gauge
unit: n/a
description: how many topics are currently in your cluster
[metrics] Kafka Topic Limit
metric:
warpstream_topics_count_limitgroup_by:
virtual_cluster_idtype: gauge
unit: n/a
description: how many topics are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed
[metrics] Kafka Partition Count
metric:
warpstream_partitions_countgroup_by:
virtual_cluster_idtype: gauge
unit: n/a
description: how many partitions are currently in your cluster
[metrics] Kafka Partition Limit
metric:
warpstream_partitions_count_limitgroup_by:
virtual_cluster_idtype: gauge
unit: n/a
description: how many partitions are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed
[metrics] Kafka Records Count
metric:
warpstream_num_recordsgroup_by:
topic,virtual_cluster_idtype: gauge
unit: n/a
description: how many records are currently in a given topic or cluster
note this number might not match the number of active keys if you are using compacted topics
Control Plane
Visualizing WarpStream control plane latency and error rates can be useful for debugging.
[metrics] Operations Outcome
metric:
warpstream_agent_control_plane_operation_countergroup_by:
virtual_cluster_id,outcome,operationtags:
virtual_cluster_id,outcome,operationtype: counter
unit: request
description: count of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).
[metrics] Operations Latency
metric:
warpstream_agent_control_plane_operation_latencygroup_by:
virtual_cluster_id,outcome,operationtags:
virtual_cluster_id,outcome,operationtype: histogram
unit: seconds
description: latency of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).
Schema Registry
Metrics and logs associated with the hosted schema registry provide insights into request handling, latency, and throughput.
Enable Schema Registry Request Logs
Logging schema registry requests is disabled by default.
To enable request logs:
Use the command-line flag
-schemaRegistryEnableLogRequest.Alternatively, set the environment variable
WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true.
[metrics] Schema Registry Requests
metric:
warpstream_agent_schema_registry_outcomegroup_by:
schema_registry_operation,outcometype: counter
unit: n/a
description: outcome (success, error, etc) for each Schema Registry request.
[metrics] Schema Registry Latency
metric:
warpstream_agent_schema_registry_request_latencygroup_by:
schema_registry_operation,outcometype: histogram
unit: seconds
description: latency for processing each Schema protocol request.
[metrics] Schema Registry Inflight Connections
metric:
warpstream_agent_schema_registry_inflight_connectionsgroup_by:
type: gauge
unit: n/a
description: number of currently inflight / active connections.
[logs] Schema Registry Request Logs
query:
service:sr-agent schema_registry_requestgroup_by:
outcome,request_type,schema_id,subject,versiondescription: every schema registry request will emit a log with the following attributes:
request_typeandoutcome. Some requests will also emit additional attributes such asschema_id,subject, etc if applicable. This is disabled by default and requires setting the command line flag-schemaRegistryEnableLogRequestor setting the environment variableWARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true
[metrics] Num Invalid Records
metric:
warpstream_schema_validation_invalid_record_countgroup_by:
topic,reasontype: counter
unit: n/a
description: counter of the number of invalid records that the agent detects when schema validation is enabled.
note that the topic would only be a tag if high cardinality is enabled.
[metrics] Schema Linking Number of Source subject versions
metric:
warpstream_schema_linking_source_subject_versions_countgroup_by:
sync_id,config_idtype: gauge
unit: n/a
description: number of source subject versions that Schema Linking is currently managing
[metrics] Schema Linking Number of Newly Migrated Subject Versions
metric:
warpstream_schema_linking_newly_migrated_subject_versionsgroup_by:
sync_id,config_idtype: gauge
unit: n/a
description: number of newly migrated subject versions performed by the latest sync, this should usually be zero unless new schemas are found. Schemas are only new for one sync and the frequency is configurable with a default of 5m
[metrics] Schema Versions Count
metric:
warpstream_schema_versions_countgroup_by:
type: gauge
unit: n/a
description: number of schemas currently in your registry.
[metrics] Schema Versions Limit
metric:
warpstream_schema_versions_limitgroup_by:
type: gauge
unit: n/a
description: number of schemas allowed in your registry.
Background Jobs
The control plane assigns agents background jobs for things like compaction or retention. These are the metrics and logs on the efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.
[logs] Compactions by Status and Level
query:
service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:infometric: *
group_by:
status,@stream_job_input.compaction.compaction_leveldescription: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.
[metrics] Executed Jobs
metric:
warpstream_agent_run_and_ack_job_outcomegroup_by:
job_typetags:
job_typeoutcometype: counter
unit: n/a
description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.
[metrics] Compaction Files per Level (Indicator of Compaction Lag)
metric:
warpstream_files_countgroup_by:
compaction_leveltype: gauge
unit: n/a
description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).
[metrics] Dead Files Scanner: Checked vs Deleted Files
metric:
warpstream_agent_deadscanner_outcomesgroup_by:
outcometags:
outcometype: counter
unit: n/a
description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.
[logs] P99 Compaction Duration by Level
query:
service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:infometric:
@duration_msgroup_by:
status,@stream_job_input.compaction.compaction_leveldescription: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).
[logs] Compaction File Output Size
query:
status:infometric:
@stream_job_output.compaction.file_metadatas.index_offsetgroup_by:
source,@stream_job_input.compaction.compaction_leveldescription: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.
Object Storage
Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.
[metrics] S3 Operations (PUT)
metric:
warpstream_blob_store_operation_latencyfilter_tag:
operation:put_bytes,operation:put_streamgroup_by: operation
type: histogram
unit: seconds
description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.
note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
[metrics] S3 Operations (GET)
metric:
warpstream_blob_store_operation_latencyfilter_tag:
operation:get_stream,operation:get_stream_rangegroup_by: operation
type: histogram
unit: seconds
description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.
note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
Last updated
Was this helpful?