Diagnostics
Diagnostics are Warpstream's feedback system, designed to provide actionable insights about the health, cost, and performance of your cluster. Each diagnostic measures specific conditions, evaluates their impact, and provides suggestions to address potential issues.
Check out this video to learn more about Diagnostics:
Dimensions of a Diagnostic
Each diagnostic has the following dimensions:
Type: The category of the diagnostic, such as
Health
orCost
.Name: The component or system that the diagnostic evaluates.
Successful: Indicates whether the diagnostic has passed.
Severity: The impact level of the diagnostic, ranging from
Low
toCritical
.Muted: Specifies whether the diagnostic is muted and should not generate alerts.
Viewing Diagnostics
Diagnostics can be accessed in two ways:
Console UI: The diagnostics are displayed in the Warpstream console interface, providing a visual overview of the cluster's health and performance.
Agent Metrics: Diagnostics are also exposed as agent metrics, enabling integration with monitoring systems. Each diagnostic generates a
warpstream_diagnostic_status
metric with the following tags:diagnostic_name
: The name of the diagnostic.diagnostic_type
: The category of the diagnostic.successful
: Whether the diagnostic is successful.muted
: Whether the diagnostic is muted.
These metrics can be used for custom monitoring and alerting setups.
Diagnostics Catalog
Below is a catalog of diagnostics with a brief description of what each check covers.
ACL Denied
Detects access denials by ACLs (principal/resource/operation/API); verify privileges.
Agent Backpressure
Detects agent backpressure events indicating overload; may increase latency/throttling.
Agent Binpacked
Flags agents allocated ≤60% of instance vCPUs (bin-packed), which can hurt performance; use dedicated capacity.
Agent Connection Limit
Detects agents approaching their maximum connection limits; scale cores/agents or increase limits.
Agent Version
Detects agents running older versions; recommends upgrading to stay within supported and optimized releases.
Cluster Load per AZ
Checks average and P90 CPU load per Availability Zone to identify imbalanced or overloaded AZs.
Cluster Load per Group Role
Checks average and P90 CPU load per agent group/role combination to spot hotspots.
Cluster Total Load
Monitors average and P90 agent CPU load to keep usage below recommended thresholds.
Compacted topics with finite retention
Detects compacted‑only topics configured with finite retention; suggests infinite retention or compact,delete.
Consumer Groups Configs
Detects risky consumer group timeouts (rebalance/session/heartbeat) that can trigger rebalances.
Consumer Groups Memory Limit
Warns when consumer groups memory usage nears configured limits.
Control Plane Connectivity
Detects agents unable to reach the control plane for sustained periods.
Control Plane Utilization (Global)
Monitors overall control plane load across all request categories; warns when utilization is high.
Control Plane Utilization (Produce Requests)
Tracks control plane load specifically from produce requests; warns when utilization is high.
Cross AZ Kafka Clients
Detects clients connecting across AZs (missing AZ hints); increases latency and cross‑AZ costs.
Embedded Pipeline
Detects pipelines running on agents that also run other roles; recommends dedicated pipeline agents.
Fetch Requests Size
Detects Kafka clients using too‑small fetch sizes, which increases overhead and reduces throughput.
Fetch Requests Size Limits
Detects client fetch limits that exceed agent limits (global or per‑partition), causing caps and smaller fetches.
Fetch Requests Timeout
Detects too‑small fetch timeouts on clients, which can cause timeouts under larger fetch sizes.
File Sizes
Detects excessively small L0 files given throughput/agents; suggests fewer agents or larger batches.
GOMAXPROCS Environment Variable
Detects agents with GOMAXPROCS
unset in constrained environments, which can degrade performance.
Instance Networking
Detects non network‑optimized instance types; recommends network‑optimized for better throughput/latency.
Inter‑Agent Connectivity
Detects agents that cannot reach peers in the same AZ/group; harms distributed cache efficiency and increases costs.
Invalid Role Request
Detects clients sending produce/fetch to agents with the wrong role (role‑split deployments).
Kafka Client Version
Detects clients using old API versions; upgrade to modern Kafka client versions.
Known Bad Agent Versions
Detects agents running versions with known issues; recommends upgrading beyond affected range(s).
Known Bad Kafka Clients
Detects client libraries/versions with known issues or poor idempotent performance.
L0 Compaction Lag
Detects excessive L0 file count, indicating compactions are falling behind.
L1 Compaction Lag
Detects excessive L1 file count, indicating compactions are falling behind.
Mixed Agent Versions
Detects multiple agent versions co‑existing for too long; standardize versions across agents.
Missing Roles
Ensures at least one agent runs critical roles: jobs
, proxy-produce
, and proxy-consume
.
Orbit Invalid Write to Topic
Detects client writes to topics still managed by Orbit; set irreversible disable flag before producing directly.
Partitions Limit
Warns when approaching the cluster partitions limit.
Pipeline Not Runnable
Pipelines are running but there are no agents with the pipelines
role in the relevant agent group(s).
Pipelines Missing Roles
Ensures each running pipeline group has agents with proxy-produce
and proxy-consume
in corresponding agent groups.
Pipelines Shared Rate Limit Exceeded
Detects Bento pipelines exceeding the per‑agent shared rate limit for pipelines throughput.
Produce Batches
Warns when batches per second are very high, indicating suboptimal client batching/idempotence.
Produce Batches vs Throughput
Detects too many batches relative to bytes/s (small average batch size).
Produce Records Too Large
Detects produce attempts with records exceeding configured size limits.
Storage Bucket Throttled
Detects recent object storage throttling events from the cloud provider.
Stuck Consumer
Detects consumers that have not progressed for a while (lagging at a fixed offset).
Tableflow Records Skipped
Detects records skipped during Tableflow ingestion due to schema mismatches.
Tableflow tables Limit
Warns when approaching the Tableflow tables limit in a Datalake cluster.
TLS Mismatch
Detects TLS‑enabled clients attempting TLS handshakes against non‑TLS agents; indicates config mismatch.
Topics Limit
Warns when approaching the cluster topics limit (active and total/soft‑deleted).
Last updated
Was this helpful?