Object Storage Configuration

This page describes how to properly configure object storage for BYOC Agent deployments.

We highly recommend running the WarpStream Agent with a dedicated bucket for isolation; however, the WarpStream Agent will only write/read data under the warpstream prefix.

The WarpStream Agent manages all data in the object storage warpstream directory. It is extremely important that you allow it to do so alone and never delete files from the warpstream directory manually. Manually deleting files in the warpstream directory will effectively "brick" a virtual cluster and require that it be recreated from scratch.

Bucket URL Construction

The bucketURL flag is the URL of the object storage bucket that the WarpStream Agent should write to. See the table below for how to configure it for different object store implementations.

Note that the WarpStream Agents will automatically write all of their data to a top-level warpstream prefix in the bucket. In addition, each cluster will write its data to a cluster-specific prefix (derived from the cluster ID) within the warpstream prefix so multiple WarpStream clusters can co-exist within the same object storage bucket without issue.

Format: s3://$BUCKET_NAME?region=$BUCKET_REGION

Example: s3://my_warpstream_bucket_123?region=us-east-1

The WarpStream Agent embeds the official AWS Golang SDK V2 so authentication/authorization with the specified S3 bucket can be handled in any of the expected ways, like using a shared credentials file, environment variables, or simply running the Agents in an environment with an appropriate IAM role with Write/Read/Delete/List permissions on the S3 bucket.

If you want to use an AssumeRole provider to authenticate, you can add the WARPSTREAM_BUCKET_ASSUME_ROLE_ARN_DEFAULT environment variable to your Agent. For example:

WARPSTREAM_BUCKET_ASSUME_ROLE_ARN_DEFAULT=arn:aws:iam::103069001423:role/YourRoleName

S3-compatible Object Stores (MinIO, R2, Oracle Cloud, Tigris, etc)

If you're using an "S3 compatible" object store that is not actually S3, like MinIO, R2 or Oracle Cloud Object Store then you'll need to provide credentials manually as environment variables and force the S3 client to construct the URL using the "path style":

If you have a MinIO docker container running locally on your machine on port 9000, you can run the Agent like this after creating an Access Key in the MinIO UI:

AWS_ACCESS_KEY_ID="wKghTMkQrFqszshHJcop" \
AWS_SECRET_ACCESS_KEY="MpMO9GFMaoIFFYd8cZi5gyk5SAjwleEbkZOSxIXv" \
warpstream demo \
-bucketURL "s3://warpstream-minio-bucket?s3ForcePathStyle=true&endpoint=http://127.0.0.1:9000"

The MinIO team has a more detailed integration guide on their website as well.

Using a Bucket Prefix

If you want the WarpStream Agents to store data in a specific prefix in the bucket, you can add the prefix as a query argument to the bucket URL. The prefix must terminate with a "/". For example:

s3://my_warpstream_bucket_123?region=us-east-1&prefix=my_prefix/

Bucket Configuration

The WarpStream bucket should not have a configured object retention policy. WarpStream will manage the lifecycle of the objects, including deleting objects that have been compacted or have expired due to retention. If you must configure a retention policy on your bucket, make sure it is significantly longer than the longest retention of any topic/stream in any of your Virtual Clusters to avoid data loss.

We recommend configuring a lifecycle policy for cleaning up aborted multi-part uploads. This will prevent failed file uploads from the WarpStream Agent from accumulating in the bucket forever and increasing your storage costs. Below is a sample Terraform configuration for various different cloud providers:

resource "aws_s3_bucket" "warpstream_bucket" {
  bucket = "my-warpstream-bucket-123"

  tags = {
    Name        = "my-warpstream-bucket-123"
    Environment = "staging"
  }
}

resource "aws_s3_bucket_metric" "warpstream_bucket_metrics" {
 bucket = aws_s3_bucket.warpstream_bucket.id
 name   = "EntireBucket"
}

resource "aws_s3_bucket_lifecycle_configuration" "warpstream_bucket_lifecycle" {
  bucket = aws_s3_bucket.warpstream_bucket.id

  # Automatically cancel all multi-part uploads after 7d so we don't accumulate an infinite
  # number of partial uploads.
  rule {
    id     = "7d multi-part"
    status = "Enabled"
    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }
  
  # No other lifecycle policy. The WarpStream Agent will automatically clean up and
  # deleted expired files.
}

Bucket Permissions

In addition to configuring the WarpStream buckets, you'll also need to make sure the Agent containers have the appropriate permissions to interact with the bucket.

Specifically, the Agents need permission to perform the following operations:

  • PutObject

    • To create new files.

  • GetObject

    • To read existing files.

  • DeleteObject

    • So the Agents can enforce retention and cleanup of pre-compaction files.

  • ListBucket

    • So the Agents can enforce retention and cleanup of pre-compaction files.

Below is an example Terraform configuration for an AWS IAM policy document that provides WarpStream with the appropriate permissions to access a dedicated S3 bucket:

data "aws_iam_policy_document" "warpstream_s3_policy_document" {
  statement {
    sid     = "AllowS3"
    effect  = "Allow"
    actions = [
      "s3:PutObject",
      "s3:GetObject",
      "s3:DeleteObject",
      "s3:ListBucket"
    ]
    resources = [
      "arn:aws:s3:::my-warpstream-bucket-123",
      "arn:aws:s3:::my-warpstream-bucket-123/*"
    ]
  }
}

Migrating Between Object Storage Buckets

If you need to migrate a WarpStream cluster from one object storage bucket to another, follow these steps:

  1. Make sure that the Agents have permission to perform operations on both the old bucket and the new bucket.

  2. Deploy the Agents with the bucketURL flag set to the new bucket instead of the old one. This will cause the Agents to write all new files (both for ingestion and compaction) to the new bucket while still allowing them to read historical data from the old bucket.

  3. Wait until there are no more data files under the warpstream prefix in the old bucket.

Last updated