Links
Comment on page

Service Discovery

This page explains how WarpStream's custom service discovery mechanism works, as well as how it interacts with Kafka's service discovery system.
The "Hacking the Kafka Protocol" blog post is a good complement to this documentation page and provides some more context and background on the design decisions that were taken.

Kafka Service Discovery

Before we discuss how WarpStream's service discovery system works, lets first go over the service discovery mechanism in Kafka. Kafka client's are usually instantiated with a list of URLs that identify a set of "bootstrap" Kafka brokers. The URLs can be raw IP addresses, or they can be hostnames that are resolved by the client to IP addresses via DNS.
Once the client has connected to the bootstrap brokers, it will issue a Metadata request for the cluster. The response to this request will contain a list of which brokers are in the cluster, what their hostname / IP addresses are, what port they're available on, which rack (or AZ) they're running in, and which topic-partitions they're the leader for.
The client will use this metadata to establish connections to all the other Kafka brokers in the cluster that it needs to communicate with. The client will always try and communicate with the leader of a given topic-partition when its producing data, and one of the replicas (depending on how its configured) when its consuming data.
In addition, if the client is using Kafka's consumer group functionality, it will also make a FindCoordinator request to the cluster to determine which Kafka broker is the "group coordinator" for its consumer group, and then the client will establish a connection to that broker for the purposes of participating in the consumer group protocol.

Mapping WarpStream Service Discovery to Kafka Service Discovery

Kafka brokers are inherently stateful. Therefore, the focus of the Kafka service discovery system is to ensure that clients connect to the right brokers that are responsible for the topic-partitions that they want to produce to or fetch from, as well as regularly fetching updated metadata to react to changes in topic-partition leadership.
WarpStream Agents, on the other hand, are completely stateless, and any WarpStream Agent can service any produce or fetch request. Therefore the focus of the WarpStream service discovery system is different from that of Kafka's. WarpStream service discovery system has exactly two goals:
  1. 1.
    Keep traffic / load as evenly balanced across the WarpStream Agents as possible.
  2. 2.
    Keep communication between Kafka clients and WarpStream Agents zone-local as much as possible to minimize inter-zone networking costs.
WarpStream service discovery begins the moment the Agents are deployed. The Agents will use cloud-specific APIs to try and automatically discover which availability zone they're running in (this currently works in AWS, GCP, Azure, and Fly.io) and then publish this information along with their internal IP address to WarpStreams' service discovery system.
A WarpStream cluster evenly distributed across 3 availability zones in us-east-1.
You can override the availability zone the Agents report to the discovery system using the WARPSTREAM_AVAILABILITY_ZONE environment variable.
You can double check the configuration child page to read more about how to override the default values used for service discovery.
Next, the Kafka clients in your applications need to be configured with a new set of bootstrap URLs. In the general case, you can just use one of the zone-specific magic URLs provided by our admin console.
For example, if your client application is running in us-east-1a, then you could use the api-XXXXXXXXX.us-east-1a.discovery.prod-z.us-east-1.warpstream.com:9092 URL to ensure your clients only ever communicate with WarpStream Agents running in us-east-1a.
If you don't care about interzone networking costs, then you could use the non AZ specific URL: api-XXXXXXXXXX.discovery.prod-z.us-east-1.warpstream.com:9092 instead.
In addition to specifying an appropriate WarpStream magic URL, you also need to encode the availability zone of your application into your Kafka client's "client ID". See our documentation on "Configuring your Apache Kafka Client for WarpStream" for more details on how to do that.
Regardless, once you've configured a magic URL and client ID in your Kafka client, service discovery will proceed similarly to how it does in traditional Kafka. First, the client will use DNS to to resolve the WarpStream magic URL to an IP address. The WarpStream DNS server will return a WarpStream Agent IP address in the zone specified by the magic URL.
Next, the client will issue a Metadata request to the WarpStream Agent identified in the previous step. The Agent will proxy the Metadata request to the WarpStream discovery service which will return an appropriate Agent hostname / IP address that is running in the same availability zone as your client if you've properly encoded your application's availability zone into the Kafka client's client ID. If you haven't, it will return an Agent hostname / IP address that is running in the same availability zone as the agent who proxied the request. This is how WarpStream "injects" zone awareness into the Kafka protocol, even for messages like Produce which don't support "rack awareness" in the traditional Kafka protocol.
This approach also solves for load balancing. Proxying the Metadata request to the WarpStream service discovery system provides the service discovery system with an opportunity to perform load balancing. Currently, it does this using a naive (but effective) global round-robin balancing algorithm, although we will make the algorithm more nuanced in the future.
WarpStream takes a slightly different approach for the FindCoordinator API. The amount of communication between the Kafka clients and the "group coordinator" is minimal and has almost no impact on load balancing on inter-zone networking. Because of that, we simply return a magic URL in the FindCoordinator response and just let the Kafka client perform DNS resolution on its own.

Zone Aware Magic URLs

WarpStream's Magic URLs allow you to configure clients that are producing and consuming data to never communicate with WarpStream Agents in a different availability zone. This prevents you experiencing the high cost of cross-AZ data transfer. Agents store data in object storage which handles the replication for you automatically, so communicating with Agents in another AZ is never required and would incur excessive inter-az charges for no additional value.
Magic URLs are used in place of the bootstrap servers for Apache Kafka or the Amazon Kinesis-compatible endpoint. They are provided on the details page for each Agent Pool and look something like this:
api_xyz123.$AVAILABILITY_ZONE.discovery.$CELL.us-east-1.warpstream.com
The above Magic URL is just an example. Do not try to construct the Magic URL from the template above by itself. Please view the details page for your Agent Pool to find your specific Magic URL template.
You should replace $AVAILABILITY_ZONE with the availability zone of your client. Each cloud provider has a metadata endpoint you can send an HTTP request to determine your client's availability zone. $CELL will be provided in the template Magic URL on the details page for your Agent Pool.
Here are the APIs you can use to determine your availability zone in a few major cloud providers:
Don't forget that you need to encode your client's availability zone in your Kafka Client's "client id" in addition to the magic URL.

Bypassing WarpStream's Magic URLs

If you can't use WarpStream's magic URLs for some reason, you can still use WarpStream. However, if you still want the benefits of zone-aware routing then you'll need to ensure that:
  1. 1.
    The Agents are able to report their correct availability zone to the WarpStream discovery system (either via their built-in support, or by setting the WARPSTREAM_AVAILABILITY_ZONE environment variable.
  2. 2.
    Whatever alternative service discovery mechanism you use to enable your clients to discover the WarpStream Agents for bootstrapping purposes is zone aware. This means that it must route the client's initial bootstrap requests to a WarpStream Agent running in the same availability zone as the client. The simplest way to accomplish this is to just have a dedicated service discovery endpoint / DNS record for every availability zone.
  3. 3.
    You must set the WARPSTREAM_DISCOVERY_KAFKA_HOSTNAME_OVERRIDE environment variable on the Agents to a unique value that is routable within your VPC, but also uniquely identifies a specific Agent. For example, the Agent's internal IP address or EC2 hostname's DNS record would work fine. Overriding this value is important, even if the Agent's are able to correctly discover their own IP address, because it signals to the WarpStream discovery system that the caller is "managing their own hostnames" and will make the FindCoordinator API return one of the Agent's hostnames instead of a generic WarpStream Magic URL like it normally would.
Point #2 is subtle, but important. The Kafka protocol has no general purpose mechanism for a client to announce what availability zone (or "Rack" in Kafka terminology) it's running in. Client's can only specify their "Rack" when performing Fetch requests, but not Produce requests.
Therefore, the WarpStream discovery system has to "infer" what availability zone the client is running in based on the availability zone of the Agent that it's connected to.
Apache, Apache Kafka, Kafka, and associated open source project names are trademarks of the Apache Software Foundation. Kinesis is a trademark of Amazon Web Services.