Diagnostics

Diagnostics are Warpstream's feedback system, designed to provide actionable insights about the health, cost, and performance of your cluster. Each diagnostic measures specific conditions, evaluates their impact, and provides suggestions to address potential issues.

Check out this video to learn more about Diagnostics:

Dimensions of a Diagnostic

Each diagnostic has the following dimensions:

  • Type: The category of the diagnostic, such as Health or Cost.

  • Name: The component or system that the diagnostic evaluates.

  • Successful: Indicates whether the diagnostic has passed.

  • Severity: The impact level of the diagnostic, ranging from Low to Critical.

  • Muted: Specifies whether the diagnostic is muted and should not generate alerts.

Viewing Diagnostics

Diagnostics can be accessed in two ways:

  1. Console UI: The diagnostics are displayed in the Warpstream console interface, providing a visual overview of the cluster's health and performance.

  2. Agent Metrics: Diagnostics are also exposed as agent metrics, enabling integration with monitoring systems. Each diagnostic generates a warpstream_diagnostic_status metric with the following tags:

    • diagnostic_name: The name of the diagnostic.

    • diagnostic_type: The category of the diagnostic.

    • successful: Whether the diagnostic is successful.

    • muted: Whether the diagnostic is muted.

    These metrics can be used for custom monitoring and alerting setups.

Diagnostics Catalog

Below is a catalog of diagnostics with a brief description of what each check covers.

Diagnostic
Description

ACL Denied

Detects access denials by ACLs (principal/resource/operation/API); verify privileges.

Agent Backpressure

Detects agent backpressure events indicating overload; may increase latency/throttling.

Agent Binpacked

Flags agents allocated ≤60% of instance vCPUs (bin-packed), which can hurt performance; use dedicated capacity.

Agent Connection Limit

Detects agents approaching their maximum connection limits; scale cores/agents or increase limits.

Agent Version

Detects agents running older versions; recommends upgrading to stay within supported and optimized releases.

Cluster Load per AZ

Checks average and P90 CPU load per Availability Zone to identify imbalanced or overloaded AZs.

Cluster Load per Group Role

Checks average and P90 CPU load per agent group/role combination to spot hotspots.

Cluster Total Load

Monitors average and P90 agent CPU load to keep usage below recommended thresholds.

Compacted topics with finite retention

Detects compacted‑only topics configured with finite retention; suggests infinite retention or compact,delete.

Consumer Groups Configs

Detects risky consumer group timeouts (rebalance/session/heartbeat) that can trigger rebalances.

Consumer Groups Memory Limit

Warns when consumer groups memory usage nears configured limits.

Control Plane Connectivity

Detects agents unable to reach the control plane for sustained periods.

Control Plane Utilization (Global)

Monitors overall control plane load across all request categories; warns when utilization is high.

Control Plane Utilization (Produce Requests)

Tracks control plane load specifically from produce requests; warns when utilization is high.

Cross AZ Kafka Clients

Detects clients connecting across AZs (missing AZ hints); increases latency and cross‑AZ costs.

Embedded Pipeline

Detects pipelines running on agents that also run other roles; recommends dedicated pipeline agents.

Fetch Requests Size

Detects Kafka clients using too‑small fetch sizes, which increases overhead and reduces throughput.

Fetch Requests Size Limits

Detects client fetch limits that exceed agent limits (global or per‑partition), causing caps and smaller fetches.

Fetch Requests Timeout

Detects too‑small fetch timeouts on clients, which can cause timeouts under larger fetch sizes.

File Sizes

Detects excessively small L0 files given throughput/agents; suggests fewer agents or larger batches.

GOMAXPROCS Environment Variable

Detects agents with GOMAXPROCS unset in constrained environments, which can degrade performance.

Instance Networking

Detects non network‑optimized instance types; recommends network‑optimized for better throughput/latency.

Inter‑Agent Connectivity

Detects agents that cannot reach peers in the same AZ/group; harms distributed cache efficiency and increases costs.

Invalid Role Request

Detects clients sending produce/fetch to agents with the wrong role (role‑split deployments).

Kafka Client Version

Detects clients using old API versions; upgrade to modern Kafka client versions.

Known Bad Agent Versions

Detects agents running versions with known issues; recommends upgrading beyond affected range(s).

Known Bad Kafka Clients

Detects client libraries/versions with known issues or poor idempotent performance.

L0 Compaction Lag

Detects excessive L0 file count, indicating compactions are falling behind.

L1 Compaction Lag

Detects excessive L1 file count, indicating compactions are falling behind.

Mixed Agent Versions

Detects multiple agent versions co‑existing for too long; standardize versions across agents.

Missing Roles

Ensures at least one agent runs critical roles: jobs, proxy-produce, and proxy-consume.

Orbit Invalid Write to Topic

Detects client writes to topics still managed by Orbit; set irreversible disable flag before producing directly.

Partitions Limit

Warns when approaching the cluster partitions limit.

Pipeline Not Runnable

Pipelines are running but there are no agents with the pipelines role in the relevant agent group(s).

Pipelines Missing Roles

Ensures each running pipeline group has agents with proxy-produce and proxy-consume in corresponding agent groups.

Pipelines Shared Rate Limit Exceeded

Detects Bento pipelines exceeding the per‑agent shared rate limit for pipelines throughput.

Produce Batches

Warns when batches per second are very high, indicating suboptimal client batching/idempotence.

Produce Batches vs Throughput

Detects too many batches relative to bytes/s (small average batch size).

Produce Records Too Large

Detects produce attempts with records exceeding configured size limits.

Storage Bucket Throttled

Detects recent object storage throttling events from the cloud provider.

Stuck Consumer

Detects consumers that have not progressed for a while (lagging at a fixed offset).

Tableflow Records Skipped

Detects records skipped during Tableflow ingestion due to schema mismatches.

Tableflow tables Limit

Warns when approaching the Tableflow tables limit in a Datalake cluster.

TLS Mismatch

Detects TLS‑enabled clients attempting TLS handshakes against non‑TLS agents; indicates config mismatch.

Topics Limit

Warns when approaching the cluster topics limit (active and total/soft‑deleted).

Last updated

Was this helpful?