Set up Monitoring
Diagnostics
First things first, you're not alone! The WarpStream team is constantly monitoring your cluster and when we find anomalies we create Diagnostics that will alert you proactively in our UI or in our Hosted Prometheus Endpoint. These are the most common problems we have found in operation and we are constantly adding to catalog of available diagnostics. For those that need more granular detail, keep reading...
Logging
By default, the WarpStream Agent is configured to run with log level info . However, this can be changed with the WARPSTREAM_LOG_LEVEL environment variable. For example, if the info level logs are too noisy for you, you can set WARPSTREAM_LOG_LEVEL=warn.
The WarpStream Agents have an additional special log level called analytics that can be enabled by setting WARPSTREAM_LOG_LEVEL=analytics. This enables extremely detailed JSON logging that can be loaded into a logging system that supports analytics to slice and dice Agent log events and obtain a deep understanding of the workload. However, this feature emits a lot of logs, so keep that in mind before enabling it.
Metrics
The WarpStream agents expose a traditional Prometheus metrics endpoint that can be scraped by most popular tools. Prometheus metrics will automatically be exposed on the Agent "internal port" which by default is port 8080. If you set an explicit port override, then you'll need to update your Prometheus scrape configuration port as well.
All WarpStream Agent metrics begin with the warpstream_ prefix.
Recommended Metrics & Alerting
The WarpStream system is simple by design so there is less to monitor. If you are coming from open source Kafka, this should be a breath of fresh air.
In Important Metrics and Logs you will find all you need for monitoring the agent and to make sure your cluster is operational. If you would like to see the complete list of metrics, you can access those from the agent directly with $IP:8080/metrics
While there are not many alerts needed, we also provide a Recommended List of Alerts where you will find a list of key metrics for which you should configure alerts to detect issues in your agent effectively.
Some of the metrics, particularly the consumer group metrics, can become very high cardinality if the cluster contains a lot of topics or partitions. You can learn more in Monitoring Consumer Groups if you need to reduce the cardinality or disable them entirely
Health Check
The Agent exposes an HTTP health check endpoint at $IP:8080/v1/status. A successful response is the string OK with a 200 status code.
Last updated
Was this helpful?