# Important Metrics and Logs

Before reading this documentation page, please familiarize yourself with [how logs and metrics are emitted from the Agents](https://docs.warpstream.com/warpstream/agent-setup/monitor-the-warpstream-agents).

{% hint style="warning" %}
**Enable High-Cardinality Tags**

Some metrics support the use of tags with potentially high cardinality, such as tags based on topics. **This feature is disabled by default**.

**To enable high-cardinality tags:**

* Use the command-line flag `-kafkaHighCardinalityMetrics`.
* Alternatively, set the environment variable `WARPSTREAM_KAFKA_HIGH_CARDINALITY_METRICS=true`.

Tags that require enabling are clearly marked with "<mark style="color:red;">(requires enabling high-cardinality tags)</mark>" next to their name.

Furthermore, per-topic distribution / histogram metrics have 10x to 20x higher cardinality than even the regular high cardinality metrics, so any high cardinality (per-topic) metrics with type `histogram` will not be emitted with per-topic tags unless the `-kafkaHighCardinalityDistributionMetrics` flag or `WARPSTREAM_KAFKA_HIGH_CARDINALITY_DISTRIBUTION_METRICS` environment variable is set to true.
{% endhint %}

{% hint style="info" %}
**Datadog metrics**

Starting from Warpstream Agent `v679` all metrics on Datadog will start with `warpstream.` and no longer `warpstream_` . All references to metrics in our doc will keep mentioning metrics starting with `warpstream_` so you have to do the conversion when you are using Datadog and a Warpstream Agent recent enough.

You can fall back to the previous behavior by setting the `WARPSTREAM_DATADOG_NORMALIZER_PREFIX_WITH_DOT` environment variable to `false`.

This change comes along the official release of our Datadog integration, making all the Warpstream Agent metrics free if you install the integration (and use the new naming convention).
{% endhint %}

## Overview

System performance metrics and logs, focusing on error detection and resource consumption.

<mark style="color:green;">**\[logs]**</mark>**&#x20;Error Logs**

* query: `status:error`
* metric: \*
* note that some amount of error logs can be normal depending on the situation. Please contact the WarpStream team if you think a particular error log is too noisy!

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Memory usage by host**

* metric: `container.memory.usage`
* group\_by: `host`

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Used Cores**

* metric: `container.cpu.usage`
* group\_by:

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Used Cores By Host**

* metric: `container.cpu.usage`
* group\_by: `host`

## Kafka

Metrics and logs associated with the Kafka protocol provide insights into message handling, latency, and throughput.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Produce Throughput (uncompressed bytes)**

* metric: `warpstream_agent_kafka_produce_uncompressed_bytes_counter`
* group\_by: `topic` <mark style="color:red;">(requires enabling high-cardinality tags)</mark>
* type: counter
* unit: bytes
* description: number of uncompressed bytes that were produced.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Produce Throughput (compressed bytes)**

* metric: `warpstream_agent_kafka_produce_compressed_bytes_counter`

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Produce Throughput (records)**

* metric: `warpstream_agent_segment_batcher_flush_num_records_counter`
* group\_by:
* type: counter
* unit: bytes
* description: number of records that were produced.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Fetch Throughput (uncompressed bytes)**

* metric: `warpstream_agent_kafka_fetch_uncompressed_bytes_counter`
* group\_by: `topic` <mark style="color:red;">(requires enabling high-cardinality tags)</mark>
* type: counter
* unit: bytes
* description: number of uncompressed bytes that were fetched.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Fetch Throughput (compressed bytes)**

* metric: `warpstream_agent_kafka_fetch_compressed_bytes_counter`
* group\_by: `topic` <mark style="color:red;">(requires enabling high-cardinality tags)</mark>
* type: counter
* unit: bytes
* description: number of compressed bytes that were fetched.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Consumer Groups Lag**

* metric: `warpstream_consumer_group_lag`
* group\_by: `virtual_cluster_id`, `topic`, `consumer_group`, `partition`
* tags: `virtual_cluster_id`, `topic`, `consumer_group`, `partition`
* type: gauge
* unit: Kafka offsets
* description: consumer group lag measured in *offsets*.
* note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the `disableConsumerGroupsMetricsTags` flag (see [agent configuration documentation](https://docs.warpstream.com/warpstream/kafka/advanced-agent-deployment-options/agent-configuration))

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Inflight Connections**

* metric: `warpstream_agent_kafka_inflight_connections`
* group\_by:
* type: gauge
* unit: n/a
* description: number of currently inflight / active connections.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Inflight Requests (per Connection)**

* metric: `warpstream_agent_kafka_inflight_request_per_connection`
* group\_by:
* type: histogram
* unit: n/a
* description: number of currently in-flight Kafka protocol requests for individual connections.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Request Outcome**

* metric: `warpstream_agent_kafka_request_outcome`
* group\_by: `kafka_key,outcome`
* type: counter
* unit: n/a
* description: outcome (success, error, etc) for each Kafka protocol request.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Request Latency**

* metric: `warpstream_agent_kafka_request_latency`
* group\_by: `kafka_key`
* type: histogram
* unit: seconds
* description: latency for processing each Kafka protocol request.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Max Offset**

* metric: `warpstream_max_offset`
* group\_by: `virtual_cluster_id`, `topic`, `partition`
* tags: `virtual_cluster_id`, `topic`, `partition`
* type: gauge
* unit: Kafka offset
* description: max offset for every topic-partition. Can be useful for monitoring lag for applications that don't use consumer groups and manage offsets externally.
* note that this is not a high cardinality metric, but you can configure a list of the tags above to not publish to decrease this metrics cardinality through the `disableConsumerGroupsMetricsTags` flag (see [agent configuration documentation](https://docs.warpstream.com/warpstream/kafka/advanced-agent-deployment-options/agent-configuration))

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Topic Count**

* metric: `warpstream_topics_count`
* group\_by: `virtual_cluster_id`
* type: gauge
* unit: n/a
* description: how many topics are currently in your cluster

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Topic Limit**

* metric: `warpstream_topics_count_limit`
* group\_by: `virtual_cluster_id`
* type: gauge
* unit: n/a
* description: how many topics are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Partition Count**

* metric: `warpstream_partitions_count`
* group\_by: `virtual_cluster_id`
* type: gauge
* unit: n/a
* description: how many partitions are currently in your cluster

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Partition Limit**

* metric: `warpstream_partitions_count_limit`
* group\_by: `virtual_cluster_id`
* type: gauge
* unit: n/a
* description: how many partitions are allowed in your cluster. Upgrade your cluster tier or contact the WarpStream team if more are needed

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Kafka Records Count**

* metric: `warpstream_num_records`
* group\_by: `topic`, `virtual_cluster_id`
* type: gauge
* unit: n/a
* description: how many records are currently in a given topic or cluster
* note this number might not match the number of active keys if you are using compacted topics

## Control Plane

Visualizing WarpStream control plane latency and error rates can be useful for debugging.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Operations Outcome**

* metric: `warpstream_agent_control_plane_operation_counter`
* group\_by: `virtual_cluster_id`, `outcome`, `operation`
* tags: `virtual_cluster_id`, `outcome`, `operation`
* type: counter
* unit: request
* description: count of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Operations Latency**

* metric: `warpstream_agent_control_plane_operation_latency`
* group\_by: `virtual_cluster_id`, `outcome`, `operation`
* tags: `virtual_cluster_id`, `outcome`, `operation`
* type: histogram
* unit: seconds
* description: latency of RPCs between Agents and control plane, groupable by outcome (success/error) and operation (RPC type).

## Schema Registry

Metrics and logs associated with the hosted schema registry provide insights into request handling, latency, and throughput.

{% hint style="warning" %}
**Enable Schema Registry Request Logs**

Logging schema registry requests is **disabled by default**.

**To enable request logs:**

* Use the command-line flag `-schemaRegistryEnableLogRequest`.
* Alternatively, set the environment variable `WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true`.
  {% endhint %}

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Registry Requests**

* metric: `warpstream_agent_schema_registry_outcome`
* group\_by: `schema_registry_operation,outcome`
* type: counter
* unit: n/a
* description: outcome (success, error, etc) for each Schema Registry request.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Registry Latency**

* metric: `warpstream_agent_schema_registry_request_latency`
* group\_by: `schema_registry_operation,outcome`
* type: histogram
* unit: seconds
* description: latency for processing each Schema protocol request.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Registry Inflight Connections**

* metric: `warpstream_agent_schema_registry_inflight_connections`
* group\_by:
* type: gauge
* unit: n/a
* description: number of currently inflight / active connections.

<mark style="color:green;">**\[logs]**</mark>**&#x20;Schema Registry Request Logs**

* query: `service:sr-agent schema_registry_request`
* group\_by: `outcome`, `request_type`, `schema_id`, `subject`, `version`
* description: every schema registry request will emit a log with the following attributes: `request_type` and `outcome`. Some requests will also emit additional attributes such as `schema_id`, `subject`, etc if applicable. This is disabled by default and requires setting the command line flag `-schemaRegistryEnableLogRequest` or setting the environment variable `WARPSTREAM_SCHEMA_REGISTRY_ENABLE_LOG_REQUEST=true`

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Num Invalid Records**

* metric: `warpstream_schema_validation_invalid_record_count`
* group\_by: `topic`, `reason`
* type: counter
* unit: n/a
* description: counter of the number of invalid records that the agent detects when schema validation is enabled.
* note that the topic would only be a tag if high cardinality is enabled.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Linking Number of Source subject versions**

* metric: `warpstream_schema_linking_source_subject_versions_count`
* group\_by: `sync_id`, `config_id`
* type: gauge
* unit: n/a
* description: number of source subject versions that Schema Linking is currently managing

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Linking Number of Newly Migrated Subject Versions**

* metric: `warpstream_schema_linking_newly_migrated_subject_versions`
* group\_by: `sync_id`, `config_id`
* type: gauge
* unit: n/a
* description: number of newly migrated subject versions performed by the latest sync, this should usually be zero unless new schemas are found. Schemas are only new for one sync and the frequency is configurable with a default of 5m

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Versions Count**

* metric: `warpstream_schema_versions_count`
* group\_by:
* type: gauge
* unit: n/a
* description: number of schemas currently in your registry.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Schema Versions Limit**

* metric: `warpstream_schema_versions_limit`
* group\_by:
* type: gauge
* unit: n/a
* description: number of schemas allowed in your registry.

## Background Jobs

The control plane assigns agents background jobs for things like compaction or retention. These are the metrics and logs on the efficiency and status of background operations, with a focus on compaction processes and the scanning of obsolete files.

<mark style="color:green;">**\[logs]**</mark>**&#x20;Compactions by Status and Level**

* query: `service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info`
* metric: \*
* group\_by: `status`,`@stream_job_input.compaction.compaction_level`
* description: number of compactions by compaction level and status (success, error). Occasional compaction errors are normal and expected, but most compactions should succeed.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Executed Jobs**

* metric: `warpstream_agent_run_and_ack_job_outcome`
* group\_by: `job_type`
* tags: `job_type` `outcome`
* type: counter
* unit: n/a
* description: outcome (and successful acknowledgement back to the control plane) of all jobs, groupable by job type and outcome. Jobs may fail intermittently, but most jobs should complete successfully.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Compaction Files per Level (Indicator of Compaction Lag)**

* metric: `warpstream_files_count`
* group\_by: `compaction_level`
* type: gauge
* unit: n/a
* description: number of files in the LSM for each compaction level. The number of L2 files may be high for high volume / long retention workloads, but the number of files at L0 and L1 should always be low (< 1000 each).

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;Dead Files Scanner: Checked vs Deleted Files**

* metric: `warpstream_agent_deadscanner_outcomes`
* group\_by: `outcome`
* tags: `outcome`
* type: counter
* unit: n/a
* description: the "deadscanner" is a job type in WarpStream that instructs the Agents to scan the object storage bucket for files that are "dead". A file is considered dead if it exists in the object store, but the WarpStream control plane / metadata store has no record of it, indicating it failed to be committed or was deleted by data expiration / compaction.

<mark style="color:green;">**\[logs]**</mark>**&#x20;P99 Compaction Duration by Level**

* query: `service:warp-agent @stream_job_input.type:COMPACTION_JOB_TYPE status:info`
* metric: `@duration_ms`
* group\_by: `status`,`@stream_job_input.compaction.compaction_level`
* description: duration of compactions by compaction level. L1 and L2 compaction duration will vary based on workload, but L0 compactions should always be fast (<20 seconds).

<mark style="color:green;">**\[logs]**</mark>**&#x20;Compaction File Output Size**

* query: `status:info`
* metric: `@stream_job_output.compaction.file_metadatas.index_offset`
* group\_by: `source`,`@stream_job_input.compaction.compaction_level`
* description: compressed size of files generated by compaction. Useful for understanding the size of different files at different levels, but not something that needs to be closely monitored or paid attention to.

## Object Storage

Metrics and logs on object storage operations' performance and usage patterns, offering insights into data retrieval, storage efficiency, and caching mechanisms.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;S3 Operations (PUT)**

* metric: `warpstream_blob_store_operation_latency`
* filter\_tag: `operation:put_bytes`, `operation:put_stream`
* group\_by: operation
* type: histogram
* unit: seconds
* description: latency to perform PUT requests. Spikes of this value can indicate issues with the underlying object store.
* note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.

<mark style="color:blue;">**\[metrics]**</mark>**&#x20;S3 Operations (GET)**

* metric: `warpstream_blob_store_operation_latency`
* filter\_tag: `operation:get_stream`, `operation:get_stream_range`
* group\_by: operation
* type: histogram
* unit: seconds
* description: latency for time to first byte for GET requests. Spikes of this value can indicate issues with the underlying object store.
* note this metric is a histogram, so even if it emits latency, you can count the number of items emitted and get the number of operations.
