Kubernetes Known Issues
When running in EKS Availability Zone is Unset or Wrong
Symptom
In the WarpStream UI for the cluster you see warpstream-unset-az
set as the availability zone of the agent and/or errors in the agent logs similar to the following:
{"time":"2025-04-02T22:23:46.467567362Z","level":"ERROR","msg":"failed to determine availability zone","git_commit":"32d51900b2423718b692a0edd29b08b11b7dd74e","git_time":"2025-04-02T18:53:04Z","git_modified":false,"go_os":"linux","go_arch":"arm64","process_generation":"081c0596-25c3-4147-88d5-d4416cb6a998","hostname_fqdn":"warp-agent-default-67d9795854-wrwh8","hostname_short":"warp-agent-default-67d9795854-wrwh8","private_ips":["10.0.115.97"],"num_vcpus":3,"kafka_enabled":true,"virtual_cluster_id":"vci_bc62be92_d3ba_4b0c_90e8_4e7bc621a693","module":"agent_azloader","error":{"message":"awsECSErr: missing metadata uri in environment (ECS_CONTAINER_METADATA_URI_V4), likely not running in ECS\nawsEC2Err: error getting metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded\ngcpErr: error getting availablity zone: \nazureErr: error getting location: \nk8sErr: unable to get node information: nodes \"i-025487767185742f1\" is forbidden: User \"system:serviceaccount:warpstream:warpstream0-agent\" cannot get resource \"nodes\" in API group \"\" at the cluster scope"}}
Context
The WarpStream Agents try to use various methods to determine which availability zone the agent is running in.
When it can't determine the availability zone it falls back to warpstream-unset-az
and logs error messages.
Problem
AWS by default prevents EKS pods from contacting the metadata service to prevent instance metadata leaks. While this is good security practice for normal instances, it prevents services within EKS from querying information about the instance.
Solution
Option A
Use our Helm Chart to deploy WarpStream. The helm chart with it's default configuration will create a Kubernetes ClusterRole
and ClusterRoleBinding
which allows the WarpStream pods to lookup get node they are running on within the Kubernetes API and find the availability zone from node labels.
Option B
Create the appropriate ClusterRole, ClusterRoleBinding, and ServiceAccount so the WarpStream agent can get availability zone information from the Kubernetes API
apiVersion: v1
kind: ServiceAccount
metadata:
name: warpstream-agent
namespace: ${your-namespace}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: warpstream-agent
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- watch
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: warpstream-agent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: warpstream-agent
subjects:
- kind: ServiceAccount
name: warpstream-agent
namespace: ${your-namespace}
Then on your WarpStream deployment set the pod service account to warpstream-agent
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: warpstream-agent
namespace: ${your-namespace}
spec:
selector:
matchLabels:
app.kubernetes.io/app: warpstream-agent
template:
metadata:
labels:
app.kubernetes.io/app: warpstream-agent
spec:
containers:
- args:
- agent
...
image: public.ecr.aws/warpstream-labs/warpstream_agent:latest
...
serviceAccount: warpstream-agent
Option C
Modify your EKS Node Launch Template configuration to set http-put-response-hop-limit
to 2.
This will allow the pods running on a EKS instance to connect to the AWS metadata service to find the availability zone.
When running in Kubernetes WarpStream pods end up in the same zone or node
Symptom
Some or all of your WarpStream pods end up running in the same availability zone or on the same Kubernetes node instead of being evenly spread out.
Context
When running workloads in Kubernetes it will try it's best to make sure pods from the same deployment are evenly spread across all nodes and availability zones, however this isn't always possible.
Problem
Depending on Kubernetes cluster configuration and other workloads on the cluster Kubernetes may not evenly deploy WarpStream pods across zones or nodes. Some Kubernetes deployments prioritize bin-packing rather then high availability of workloads. This varies by Kubernetes distribution and is not always configurable.
Solution
Use Kubernetes topologySpreadConstraints
and podAntiAffinity
to force Kubernetes to spread WarpStream pods evenly across zones and nodes. If your WarpStream pods are using our Helm Chart you can set the following in your helm values:
topologySpreadConstraints:
# Try to spread pods across multiple zones
- maxSkew: 1 # +/- one pod per zone
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
# minDomains is only available in Kubernetes 1.30+
# Remove this field if you are on an older Kubernetes
# version.
# When possible set to the number of available
# availability zones in your cluster.
minDomains: 3
# Label Selector to select the warpstream deployment
labelSelector:
matchLabels:
app.kubernetes.io/name: warpstream-agent
app.kubernetes.io/instance: warpstream-agent # Set to your helm release name
affinity:
# Make sure pods are not scheduled on the same node to prevent bin packing
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
# Label Selector to select the warpstream deployment
- labelSelector:
matchLabels:
app.kubernetes.io/name: warpstream-agent
app.kubernetes.io/instance: warpstream-agent # Set to your helm release name
topologyKey: kubernetes.io/hostname
When an IP is reused by another agent's pod
Symptom
Kafka requests will either fail or target the wrong Kafka cluster. For instance a produce request aiming at cluster A could actually end up being processed in cluster B and the data will never be visible from cluster A resulting in data loss.
Context
If you deploy multiple Warpstream Agents k8s deployments in the same VPC, then it is totally possible that the IP of an agent that is going away - for instance during a scale down - is going to be re-used by another agent spawning up. And this new agent does not necessarily belong to the same k8s deployment, nor is connected to the same Warpstream cluster.
Let's consider the following scenario, with both clusters A and B deployed in the same VPC:
agent with IP
10.0.104.73
shuts down and it belonged to a kubernetes deployment connected to cluster Aquickly after, a new pod is starting in a kubernetes deployment connected to cluster B, and kubernetes allocates the same IP to is
a kafka application that is configured to be connected to cluster A still has
10.0.104.73
in its DNS cache and is opening a new connection to send a produce request to it.the connection is established fine, but the agent receiving the request actually belongs to cluster B. Auto-topic creation is on, so it just creates the topic, and the produce request is processed and acknowledged.
the kafka application receives an ack and is happy, it will keep this connection opened and send more requests through it
if the same application is consuming data from the same topic, it will never see it
Solution
Agent to agent communication is already protected against this. This kind of communication only happens on the read path (more info in this blogpost) and all agents will reject requests not targeting the right virtual cluster that is sent along each internal HTTP request.
However there is nothing built-in the Kafka protocol for this, and it requires clients participation to be totally safe. The most straightforward way to deal with this is to use SASL credentials: as those are unique across clusters, if a Kafka client tries to connect to an IP thinking it still belongs to cluster A, an agent belonging to cluster B will reject the connection, and the client will retry on another IP.
Alternatively, if you enable TLS, this will also completely mitigate the issue as the SSL certificate won't match and the client will get an SSL error which should cause it to retry.
Last updated
Was this helpful?