Ripcord

Ripcord enables WarpStream Agents to continue ingesting data and accepting Produce requests even when the control plane is unavailable.

Overview

Ripcord is a special mode that can be enabled on the Agents that makes them resilient to control plane unavailability for ingesting data and processing Produce requests.

This means that when the WarpStream Agents are running in Ripcord mode, they can continue ingesting data and processing Produce requests without interruption. Consumers will still become unavailable until control plane availability is restored.

Ripcord behaves very similarly to lightning topics: the Agents skip committing data to the control plane in the critical path of a produce request. Instead, they journal produce requests to object storage, and then commit them to the control plane asynchronously. This reduces Produce request latency as a side-effect, but the primary purpose is to eliminate the control plane as a critical path dependency for Producing.

While Ripcord mode is very similar to just converting all of the topics in a cluster to lightning topics, it also makes a few additional tweaks in the Agent to make them more resilient to control plane unavailability. For example, in Ripcord mode, the Agents will configure their topic metadata caches to more strongly favor availability over consistency. This means that newly created topics may take longer to become available for Produce requests.

Ripcord Agents provide the exact same durability guarantees as regular Agents: any acknowledged Produce request is guaranteed to be durable and (eventually) consumable. However, acknowledged data is not immediately visible to consumers due to the async commit which has a few broader implications. See the caveats sections for more details.

Request Flow

Producing records without Ripcord

Kafka clients connect to the WarpStream Agents in order to produce data to specific topics. When a client produces records, three steps are normally required.

The agent writes the record data to a new file in object storage.
The agent registers the address of the new file with the control plane and the control plane responds with the offsets that were assigned to the new records.
The agent forwards this information to the client in response so the client's Produce request. With this the client can acknowledge that the records have been safely persisted to the topic and at which offset.

For #2 the agent must be able to reach the WarpSteam control plane over the Internet. If that connection goes down, e.g. because of a sustained network failure or due to a potential control plane incident, the agent will reject the Produce request by default.

Producing records with Ripcord

Ripcord is a resilience feature enabling agents to process Produce requests without a working connection to the control plane.

In Ripcord mode, the agents continuously journals files to your object storage bucket, and data is being replayed asynchronously from this journal and placed inside the topics.

It works as follows:

The agent writes the record data to a new file in object storage
In response to the Produce request, the agent notifies the Kafka client that the new records have safely been journaled to object storage, but that it does not know at which offset they will be inserted yet. The unknown offset is represented as offset 0.
Asynchronously, the agent notifies the control plane about the journaled files so that the records can be assigned their respective offsets.

In the case where the control plane is momentarily unreachable, the journal grows bigger and is replayed when the control plane is reachable again.

The Agent "Ripcord mode"

You need at least v748 of the WarpStream agent to enable ripcord.

If you set -enableRipcord in your agent startup flags (or if you set the WARPSTREAM_ENABLE_RIPCORD environment variable to "true") the control plane will not be in the critical path of your produce requests, for all topics, as explained above.

Note that you if you split your agent with agent roles, you can enable ripcord only on the produce agents. However, it's very important that you upgrade all agents in the virtual cluster to at least v748, not only the produce agents.

Caveats

There are a few caveats that you should be aware of when enabling Ripcord on your agents:

You should not use the application bootstrap URL (the one that looks like api-305fe61f-4074-4e5b-3c791-33d17d4be89a.groupdefault.kafka.discoveryv2.prod-z.us-east-1.warpstream.com:9092 or something similar. This DNS address is resolved by our control plane, and if you lose connectivity to the control plane, your clients will not be able to resolve the agents anymore. If you're running in Kubernetes, the easiest solution is just use the Kubernetes service that is automatically created by the WarpStream chart as the bootstrap URL.
New agents cannot start as long as the control plane is unreachable, nor can existing agents restart. Only agents that are already running can keep serving Produce requests, provided that they have Ripcord enabled.
Without connectivity to the control plane, Ripcord agents can keep producing records but cannot process Fetch requests. Consumers will not be able to make progress until control plane availability is restored.Other than Produce, any request that depends on the control plane will fail. For example topics cannot be created and topic retention periods cannot be modified.
Offsets returned from Produce requests will always be 0. Applications cannot rely on the returned offsets in the ProduceResponse. Almost no applications rely on this, but it's good to be aware of it.
The idempotent producer Kafka feature will not work (it must be disabled on all clients producing to ripcord agents).
Kafka transactions will not work (it must be disabled on all clients producing to ripcord agents).
External consistency is no longer guaranteed. I.E if batch A is produced at time T0 and then a successful acknowledgement is received at T1, a batch B that is produced at T2 is not guaranteed to show up in the log after batch A.

Monitoring

Ripcord files are written to the object storage bucket and processed asynchronously. Developers can monitor the size of the unprocessed files backlog.

First, the warpstream.agent_ripcord_replayed_file metric indicates how many files are currently being asynchronously ingested.

Second, warpstream.agent_ripcord_oldest_replay_age and warpstream.agent_ripcord_outstanding_replays_count report the size of the backlog of replays waiting to be ingested. Since the process that registers files after ingestion is slow, the oldest replay age can typically range from a few minutes to 15 minutes, and it's OK to have 500 outstanding replays, but more probably means your agents are not processing the records fast enough.

Also note that these last two metrics are not emitted when the connectivity to the control plane is lost, so you should monitor the number of replayed file too. If this falls to 0 for agents that have -enableRipcord, and that are receiving traffic, something is wrong.

Testing

To see Ripcord in action, visit your cluster's Cluster Settings page in the admin console and click the Reject Ripcord Agent Connections button. Ripcord agents will not be able to connect to the control plane until the setting is disabled.

Use with caution!

Impact on latency

Enabling ripcord on your agents has no noticeable impact on end to end latency. The reason for this is that, although the records are journaled to S3, the agent starts replaying the journal immediately after finishing to write the file.

However, enabling ripcord lowers the produce latency because the agent responds to the client without needing to wait for the control plane to respond.

PreviousClient Configuration Auto-tuning NextOrbit (Cluster Linking)

Last updated 23 days ago

Was this helpful?

Good evening

hashtagOverview

hashtagRequest Flow

hashtagProducing records without Ripcord

hashtagProducing records with Ripcord

hashtagThe Agent "Ripcord mode"

hashtagCaveats

hashtagMonitoring

hashtagTesting

hashtagImpact on latency