Small Files

What It Is

This diagnostic monitors the file sizes generated by agents to ensure they align with ingestion throughput.

Generating too many small files can significantly increase storage costs and operational overhead due to:

Increased PUT and GET Operations: Small files lead to a higher number of storage requests. Since many object storage cost models are based on the number of requests rather than data throughput, this can quickly become expensive.
Excessive Metadata Management: Each small file generates additional metadata, adding overhead for tracking and managing file information, which increases storage cost in the control plane.

The diagnostic calculates the ideal file size based on:
- Ingestion throughput (data rate).
- Average ingestion file sizes.
- Number of availability zones.
- Number of agents.
Based on all these data points, we a set of specific scenarios. It's very hard to implement a very precise and complete diagnostic, because big files are more cost-effective but they also increase latency. So, our strategy is to detect only most extreme and clear cases.

Increase agent sizes or separate roles to dedicate agents for specific tasks.
Sometimes, you can't reduce the number of agents because they are already under a high load. This is usually because you are using small instance types, or because the agents are performing other expensive operations like consuming data or running background jobs. In such cases, you could choose a bigger instance type or separate the agents into different roles, so agents handling produce records can use as much CPU as possible

Last updated 3 months ago

Was this helpful?

This diagnostic monitors the file sizes generated by agents to ensure they align with ingestion throughput.

Generating too many small files can significantly increase storage costs and operational overhead due to:

Increased PUT and GET Operations: Small files lead to a higher number of storage requests. Since many object storage cost models are based on the number of requests rather than data throughput, this can quickly become expensive.
Excessive Metadata Management: Each small file generates additional metadata, adding overhead for tracking and managing file information, which increases storage cost in the control plane.

The diagnostic calculates the ideal file size based on:
- Ingestion throughput (data rate).
- Average ingestion file sizes.
- Number of availability zones.
- Number of agents.
Based on all these data points, we a set of specific scenarios. It's very hard to implement a very precise and complete diagnostic, because big files are more cost-effective but they also increase latency. So, our strategy is to detect only most extreme and clear cases.

Increase agent sizes or separate roles to dedicate agents for specific tasks.
Sometimes, you can't reduce the number of agents because they are already under a high load. This is usually because you are using small instance types, or because the agents are performing other expensive operations like consuming data or running background jobs. In such cases, you could choose a bigger instance type or separate the agents into different roles, so agents handling produce records can use as much CPU as possible

Last updated 3 months ago

Was this helpful?