Security and Privacy Considerations

This page describes the various security and privacy considerations for WarpStream's BYOC deployment model.

Overview

WarpStream's BYOC products address security and privacy considerations by ensuring that raw data written to your data plane clusters never leaves your VPC or object storage buckets. WarpStream does not need access to the raw data - with regard to personal information that WarpStream processes in your data plane, you receive a sovereign solution.

WarpStream receives metadata about your workloads (described in more detail in the sections below). WarpStream ensures that metadata for your workloads is stored only in the control plane region that you select when your WarpStream cluster is created, and never replicated or stored in any other region unless you explicitly opt-in to a multi-region control plane or transfer to another control plane.

In the interest of transparency, WarpStream maintains a compliance portal that includes information about our security and compliance practices, including certification reports and detailed information regarding the controls that we have implemented.

In addition to following the best practices and controls documented on our compliance portal, WarpStream also supports Kafka ACLs, as well as SASL/PLAIN and SASL/SCRAM-SHA-512 authentication.

Bring Your Own Cloud (BYOC) clusters

Raw data written to WarpStream clusters never leaves your VPC or object storage buckets. The only data that ever leaves your VPC is metadata about your Kafka workloads that is required for the correct functioning of your clusters, which includes the following:

  1. Topic names

  2. Topic metadata (partition counts, configuration, etc)

  3. File metadata (object store bucket name, compressed size, uncompressed size, etc)

  4. Record timestamps and offsets (but never record keys or record contents)

  5. Consumer group names, configuration and offsets

  6. Kafka client IDs

  7. Producer IDs, epochs, and sequence numbers

WarpStream Schema Registry clusters

Raw schemas register in the BYOC schema registry never leave your VPC or object storage buckets. The only data that leaves your VPC is metadata about your schemas that are necessary for the correct functioning of your schema registry clusters, which includes the following:

  1. Schema metadata: schema data format, schema ID

  2. Schema subject names

  3. Schema subject metadata: schema context name, compatibility rule, subject version, schema ID, soft deleted

  4. File metadata: object store bucket name, schema size

  5. Schema reference metadata: subject, subject version

  6. Schema context name

    1. Global configuration: default compatibility rule

WarpStream Tableflow

WarpStream Tableflow stores the table's schema and metadata for every file in your data lake in the control plane. This metadata includes:

  1. Bucket name where files are stored

  2. Fully-qualified path to each file

  3. File size, number of rows, and similar numeric metrics.

  4. On an opt-in basis for specific columns, the minimum and maximum value for that column in each file. This is an optional part of the Iceberg and Delta Lake table formats, but it accelerates many kinds of queries against your table. Tables are commonly sorted by timestamp, or some kind of opaque internal identifier, depending on your query patterns.

  5. For partitioned tables, on an opt-in basis, the partition values. Most tables are partitioned by date, so this is usually not sensitive information.

Tableflow also stores some metadata about the source Kafka cluster:

  1. Topic names

  2. Minimum and maximum offset for each partition

  3. Source data format (Avro, JSON, etc.)

Additionally, Tableflow depends on and stores the Agent Cluster Metadata.

Agent Cluster Metadata

  1. Agent Metadata (stored ephemerally in memory, never persisted to disk)

    1. Number of connections (for load balancing)

    2. Number of vCPUs (for determining how many concurrent jobs it can run) and utilization.

    3. Internal / Private IP addresses. These addresses are not routable from the internet, and are required so that the Agents can cluster with each other within a single availability zone.

    4. Availability zone.

  2. A small sample of the Agent's logs so that we can help diagnose and debug issues remotely. This can be disabled by setting the -disableLogsCollection flag or WARPSTREAM_DISABLE_LOGS_COLLECTION=true environment variable. These logs never contain raw data, and only contain things like error messages or high level statistics.

  3. The Agent's profiling data so that we can investigate performance degradations remotely. This can be disabled with the -disableProfileForwarding flag or the WARPSTREAM_DISABLE_PROFILE_FORWARDING environment variable. These profiles only contain information about program execution.

Last updated

Was this helpful?