Important Metrics and Logs

On this page, we include a sample list the most important logs and metrics emitted by the Agents.

Before reading this documentation page, please familiarize yourself with how logs and metrics are emitted from the Agents.

Enable High-Cardinality Tags

Some metrics support the use of tags with potentially high cardinality, such as tags based on topics. This feature is disabled by default.

To enable high-cardinality tags:

Use the command-line flag -kafkaHighCardinalityMetrics.
Alternatively, set the environment variable WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true.

Tags that require enabling are clearly marked with "(requires enabling high-cardinality tags)" next to their name.

Furthermore, per-topic distribution / histogram metrics have 10x to 20x higher cardinality than even the regular high cardinality metrics, so any high cardinality (per-topic) metrics with type histogram will not be emitted with per-topic tags unless the -kafkaHighCardinalityDistributionMetrics flag or WARPSTREAM_KAFKA_HIGH_CARDINALITY_DISTRIBUTIONS_METRICS environment variable is set to true.

Datadog metrics

Starting from Warpstream Agent v679 all metrics on Datadog will start with warpstream. and no longer warpstream_ . All references to metrics in our doc will keep mentioning metrics starting with warpstream_ so you have to do the conversion when you are using Datadog and a Warpstream Agent recent enough.

You can fall back to the previous behavior by setting the WARPSTREAM_DATADOG_NORMALIZER_PREFIX_WITH_DOT environment variable to false.

This change comes along the official release of our Datadog integration, making all the Warpstream Agent metrics free if you install the integration (and use the new naming convention).

Overview

System performance metrics and logs, focusing on error detection and resource consumption.

[logs] Error Logs
- query: status:error
- metric: *
[metrics] Memory usage by host
- metric: container.memory.usage
- group_by: host
[metrics] Used Cores
- metric: container.cpu.usage
- group_by:
[metrics] Used Cores By Host
- metric: container.cpu.usage
- group_by: host
[metrics] Circuit Breaker(multiple metrics)
- metric 1
  - warpstream_circuit_breaker_count
  - tags: name state
  - type: counter
  - unit: n/a
- metric 2
  - warpstream_circuit_breaker_permit
  - tags: name outcome
  - type: counter
  - unit: n/a
- metric 3
  - warpstream_circuit_breaker_hits
  - tags: name outcome
  - type: counter
  - unit: n/a
- metric 4
  - warpstream_circuit_breaker_state_set
  - tags: name state
  - type: gauge
  - unit: n/a
  - Note that this metric is similar to the warpstream_circuit_breaker_count but is a gauge, with values either 0 or 1, instead of a counter. It can be used to determine the current state of the circuit breaker.

Kafka

Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.

[metrics] Kafka Inflight Connections
- metric: warpstream_agent_kafka_inflight_connections
- group_by:
- type: gauge
- unit: n/a
- description: number of currently inflight / active connections.
[metrics] Kafka Inflight Requests (per Connection)
- metric: warpstream_agent_kafka_inflight_request_per_connection
- group_by:
- type: histogram
- unit: n/a
- description: number of currently in-flight Kafka protocol requests for individual connections.
[metrics] Kafka Latency
- metric: warpstream_agent_kafka_request_latency
- group_by: kafka_key
- type: histogram
- unit: seconds
- description: latency for processing each Kafka protocol request.
[metrics] Kafka Requests
- metric: warpstream_agent_kafka_request_outcome
- group_by: kafka_key,outcome
- type: counter
- unit: n/a
- description: outcome (success, error, etc) for each Kafka protocol request.
[metrics] Fetch Max Pointers in a Single Request
- metric: warpstream_agent_kafka_fetch_num_pointers_distribution
- group_by:
- type: histogram
- unit: n/a
- description: distribution/histogram of number of different pointers to be processed in a single fetch request.
[metrics] Fetch Partial Bytes Due to Errors
- metric: warpstream_agent_kafka_fetch_partial_response_error_scenario_num_bytes_distribution
- group_by: source
- type: histogram
- unit: bytes
- description: number of bytes that were returned from a fetch request as a result of an error that was encountered while trying to process the fetch request to completion. Can be useful for debugging performance issues, but not usually indicative of any issue.
[metrics] Fetch Throughput (uncompressed bytes)
- metric: warpstream_agent_kafka_fetch_uncompressed_bytes_counter
- group_by: topic (requires enabling high-cardinality tags)
- type: counter
- unit: bytes
- description: number of uncompressed bytes that were fetched.
[metrics] Fetch Throughput (compressed bytes)
- metric: warpstream_agent_kafka_fetch_compressed_bytes_counter
- group_by: topic (requires enabling high-cardinality tags)
- type: counter
- unit: bytes
- description: number of compressed bytes that were fetched.
[metrics] Produce Throughput (uncompressed bytes)
- metric: warpstream_agent_kafka_produce_uncompressed_bytes_counter
- group_by: topic (requires enabling high-cardinality tags)
- type: counter
- unit: bytes
- description: number of uncompressed bytes that were produced.
[metrics] Produce Throughput (compressed bytes)
- metric: warpstream_agent_kafka_produce_compressed_bytes_counter
[metrics] Produce Throughput (records)
- metric: warpstream_agent_segment_batcher_flush_num_records_counter
- group_by:
- type: counter
- unit: bytes
- description: number of records that were produced.
[metrics] Consumer Groups Lag
- metric: warpstream_consumer_group_lag
- group_by: virtual_cluster_id, topic, consumer_group, partition
- tags: virtual_cluster_id, topic, consumer_group, partition
- type: gauge
- unit: Kafka offsets
- description: consumer group lag measured in offsets.
- note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)
[metrics] Max Offset
- metric: warpstream_max_offset
- group_by: virtual_cluster_id, topic, partition
- tags: virtual_cluster_id, topic, partition
- type: gauge
- unit: Kafka offset
- description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.
- note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the disableConsumerGroupsMetricsTags flag (see agent configuration documentation)

Control Plane

Visualizing WarpStream control plane latency and error rates can be useful for debugging.

[metrics] Operations Outcome

metric: warpstream_agent_control_plane_operation_counter
group_by: virtual_cluster_id, outcome, operation
tags: virtual_cluster_id, outcome, operation
type: counter
unit: request
description: count of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

[metrics] Operations Latency

metric: warpstream_agent_control_plane_operation_latency
group_by: virtual_cluster_id, outcome, operation
tags: virtual_cluster_id, outcome, operation
type: histogram
unit: seconds
description: latency of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

Schema Registry

Metrics and logs associated with the hosted schema registry provide insights into request handling, latency, and throughput.

Enable Schema Registry Request Logs

Logging schema registry requests is disabled by default.

To enable request logs:

Use the command-line flag -schemaRegistryEnableLogRequest.
Alternatively, set the environment variable WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true.

[logs] Schema Registry Request Logs

query: service:sr-agent schema_registry_request
group_by: outcome, request_type, schema_id, subject, version
description: every schema registry request will emit a log with the following attributes: request_type and outcome. Some requests will also emit additional attributes such as schema_id, subject, etc if applicable. This is disabled by default and requires setting the command line flag -schemaRegistryEnableLogRequest or setting the environment variable WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true

[metrics] Schema Registry Inflight Connections

metric: warpstream_agent_schema_registry_inflight_connections
group_by:
type: gauge
unit: n/a
description: number of currently inflight / active connections.

[metrics] Schema Registry Latency

metric: warpstream_agent_schema_registry_request_latency
group_by: schema_registry_operation,outcome
type: histogram
unit: seconds
description: latency for processing each Schema protocol request.

[metrics] Schema Registry Requests

metric: warpstream_agent_schema_registry_outcome
group_by: schema_registry_operation,outcome
type: counter
unit: n/a
description: outcome (success, error, etc) for each Schema Registry request.

[metrics] Request Throughput

metric: warpstream_agent_schema_registry_request_bytes_counter
group_by: schema_registry_operation
type: counter
unit: bytes
description: number of bytes of incoming requests.

[metrics] Response Throughput

metric: warpstream_agent_schema_registry_response_bytes_counter
group_by: schema_registry_operation,outcome
type: counter
unit: bytes
description: number of bytes of outgoing response.

Background Jobs

Metrics and logs on efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.

[logs] Compaction File Output Size
- query: status:info
- metric: @stream_job_output.compaction.file_metadatas.index_offset
- group_by: source,@stream_job_input.compaction.compaction_level
- description: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.
[logs] Compactions by Status and Level
- query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info
- metric: *
- group_by: status,@stream_job_input.compaction.compaction_level
- description: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.
[metrics] Compaction Files per Level (Indicator of Compaction Lag)
- metric: warpstream_files_count
- group_by: compaction_level
- type: gauge
- unit: n/a
- description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).
[logs] P99 Compaction Duration by Level
- query: service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info
- metric: @duration_ms
- group_by: status,@stream_job_input.compaction.compaction_level
- description: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).
[metrics] Dead Files Scanner: Checked vs Deleted Files
- metric: warpstream_agent_deadscanner_outcomes
- group_by: outcome
- tags: outcome
- type: counter
- unit: n/a
- description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.
[metrics] Executed Jobs
- metric: warpstream_agent_run_and_ack_job_outcome
- group_by: job_type
- tags: job_type outcome
- type: counter
- unit: n/a
- description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.

Object Storage

Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.

[metrics] S3 Operations (GET)
- warpstream_blob_store_operation_latency
  - filter_tag: operation:get_stream, operation:get_stream_range
  - group_by: operation
  - type: histogram
  - unit: seconds
  - description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.
  Note: this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
[metrics] S3 Operations (PUT)
- warpstream_blob_store_operation_latency
  - filter_tag: operation:put_bytes, operation:put_stream
  - group_by: operation
  - type: histogram
  - unit: seconds
  - description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.
  Note: this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
[metrics] Bytes copied into cache
- metric: warpstream_agent_file_cache_server_get_range_copy_chunk_num_bytes_copied
- group_by: host
- tags: outcome
- type: counter
- unit: bytes
- description: number of bytes that have been read out of the object store and copied into the Agents in-memory, distributed file cache.
[metrics] Cache Size (Bytes)
- metric: warpstream_agent_file_cache_server_chunk_cache_curr_size_bytes
- group_by: host
- type: gauge
- unit: bytes
- description: size of the file cache in bytes.
[metrics] Cache Size (Entries)
- metric: warpstream_agent_file_cache_server_chunk_cache_num_entries
- group_by: host
- type: gauge
- unit: n/a
- description: number of entries in the file cache.
[metrics] Direct or Remote Loads
- metric: warpstream_agent_file_cache_client_fetch_local_or_remote_counter
- group_by: source
- tags: outcome source
- type: counter
- unit: n/a
- description: counter indicating how the file cache decided to load data. remote and remote_batched indicate that the data to be fetched was small enough that the file cache decided to request the data from the Agent that is responsible for the requested byte range. direct means the data to be fetched was large enough that the file cache decided to load the data directly from object storage instead of requesting it from the Agent responsible for that byte range (this is an optimization that saves some networking and CPU for large reads that are already cost-effective).
[metrics] Fetch Pointers
- metric: warpstream_agent_kafka_fetch_num_pointers_counter
- group_by:
- type: counter
- unit: n/a
- description: number of pointers required to serve fetch requests. There is no "right" value for this metric, but large increases in this value indicate that serving consumers is more difficult for the Agents and is usually caused by compaction lag (particularly L0/L1 lag), or an increase in the number of partitions that are being actively written to / queried.
[metrics] File Cache Bytes Transferred (Server)
- metric: warpstream_agent_file_cache_server_get_stream_range_num_bytes_count
- group_by: outcome,host
- type: counter
- unit: bytes
- description: number of bytes served by the Agent's distributed file cache server. This effectively indicates the amount of inter-Agent communication that is happening (within an availability zone) to power the distributed file cache that powers the fetch path.
File Cache Latency (Client)
- metric: warpstream_agent_file_cache_client_get_stream_range_latency
- group_by: outcome
- type: histogram
- unit: seconds
- description: observed latency for requesting data in the distributed file cache from an Agent acting as a client from another Agent acting as a server.
[metrics] File Cache Latency (Server)
- metric: warpstream_agent_file_cache_server_get_stream_range_latency
- group_by: outcome, host
- type: histogram
- unit: seconds
- description: observed latency for serving data in the distributed file cache as measured by the Agent acting as the server.
[metrics] File Cache Outcomes (Server)
- metric: warpstream_agent_file_cache_server_get_stream_range_outcome
- group_by: outcome, host
- type: counter
- unit: n/a
- description: outcome of distributed file cache operation (success, error, etc) as measured by the Agent acting as the server.
[metrics] File Cache Per Request Bytes Read Average (Server)
- metric: warpstream_agent_file_cache_server_get_stream_range_num_bytes_distribution
- group_by: outcome
- type: histogram
- unit: bytes
- description: distribution / histogram of the number of bytes requested in individual RPCs to the distributed file cache as measured by the Agent acting as the server.
[metrics] Num Bytes Fetched by Size
- metric: warpstream_agent_file_cache_server_fetch_size_num_bytes_counter
- group_by: fetch_size
- type: counter
- unit: bytes
- description: number of bytes requests from the underlying object store (via GET requests) by the distributed file cache, groupable by fetch_size which should usually be one of 1 MiB or 4 MiB.
[metrics] Num Fetches by Size
- metric: warpstream_agent_file_cache_server_fetch_size_counter
- group_by: fetch_size
- type: counter
- unit: n/a
- description: counter of the number of GET requests issued to the underlying object store by the distributed file cache, groupable by fetch_size which should usually be one of 1 MiB or 4 MiB.
- Note that this is a sister metric to warpstream_agent_file_cache_server_fetch_size_num_bytes_counter which counts the number of time the bytes counter is incremented.

Schema Validation

[metrics] Num Invalid Records
- metric: warpstream_chema_validation_invalid_record_count
- group_by: topic, reason
- type: counter
- unit: n/a
- description: counter of the number of invalid records that the agent detects when schema validation is enabled. Note that the topic would only be a tag if high cardinality is enabled.

Schema Linking

[metrics] Num source subject versions
- metric: warpstream_schema_linking_source_subject_versions_count
- group_by: sync_id, config_id
- type: gauge
- unit: n/a
- description: number of source subject versions that Schema Linking detected
[metrics] Num Newly Migrated Subject Versions
- metric: warpstream_schema_linking_newly_migrated_subject_versions
- group_by: sync_id, config_id
- type: gauge
- unit: n/a
- description: number of newly migrated subject versions performed by the sync

PreviousPre-made Grafana Dashboard NextRecommended List of Alerts

Last updated 27 days ago

Was this helpful?