Regardless of whether you choose COPY or Snowpipe, certain foundational best practices will dramatically improve the performance and cost-efficiency of your data ingestion pipelines. Neglecting these principles is one of the most common pitfalls we observe.
File Sizing: The 100-250MB Sweet Spot
File size has a direct impact on loading performance. Snowflake achieves its impressive ingestion speed by parallelizing the process. However, this parallelism is limited by the number of files you are loading.
- Problem with Small Files: Ingesting thousands of tiny files (e.g., a few KBs each) creates significant overhead. Snowflake has to manage the metadata and transaction for each file, and the per-file overhead in Snowpipe can make this approach very costly.
- Problem with Huge Files: A single, massive file (e.g., 5GB) cannot be easily parallelized. Snowflake will be stuck processing it on a limited number of threads, creating a bottleneck.
The optimal size, as recommended in Snowflake’s official data loading guidelines, is between 100MB and 250MB per file (compressed). This range is large enough to minimize overhead but small enough for Snowflake to break the work into parallel threads effectively.
File Format & Compression: Choose Parquet with Snappy
The format of your files is just as important as their size. For analytics workloads, columnar file formats are vastly superior to row-based formats like CSV or JSON.
We recommend using Apache Parquet. Because it stores data in columns, Snowflake’s query engine can scan only the columns needed for a query, drastically reducing I/O and accelerating performance. Snowflake’s vectorized Parquet scanner is highly optimized for this format. You can learn more about its structure at the Apache Parquet official site.
For compression, Snappy or ZSTD offer the best balance between a good compression ratio and fast decompression speed. While Gzip provides higher compression, the computational cost to decompress it can sometimes slow down the ingestion process.
Leveraging Parallelism with a Sized Warehouse
When using the COPY command, the size of your virtual warehouse directly determines the level of parallelism. A larger warehouse has more available threads for processing files.
Here’s a simple rule of thumb: the number of files you load in a batch should ideally be a multiple of the number of threads in your warehouse. For example, an X-LARGE warehouse provides 32 threads. To maximize its utilization, you should aim to load at least 32 files (each 100-250MB) in a single COPY run. Loading just a few large files on an X-LARGE warehouse is a common mistake that leads to wasted credits, as most of the available compute power sits idle.