CockroachDB’s IMPORT INTO statement is designed to be the fastest way to get data into your cluster, but it’s not magic.
Let’s see it in action. Imagine you have a CSV file, users.csv, with millions of rows you want to load into a users table.
id,username,email
1,alice,alice@example.com
2,bob,bob@example.com
...
Here’s how you’d typically run the import:
IMPORT INTO users (id, username, email) FROM CSV '/path/to/your/users.csv';
This command tells CockroachDB to read the specified CSV file and insert its contents into the users table. It leverages multiple nodes in your cluster to parallelize the ingestion process, making it significantly faster than row-by-row INSERT statements.
The core problem IMPORT INTO solves is efficient bulk data loading. Traditional INSERT statements are transactional, meaning each row insertion is a separate transaction that needs to be committed. This incurs substantial overhead for large datasets. IMPORT INTO, on the other hand, treats the entire batch of data as a single operation, minimizing transactional overhead and maximizing throughput.
Internally, IMPORT INTO works by splitting the input data into chunks. These chunks are then distributed to various nodes in your cluster. Each node processes its assigned chunks, generating key-value pairs that are written directly to the cluster’s storage layer. CockroachDB’s distributed nature means these writes can happen in parallel across multiple nodes and disks, dramatically accelerating the load.
The primary levers you control are the IMPORT INTO statement itself and the configuration of your CockroachDB cluster. The statement allows you to specify the target table, the source data (CSV, Avro, etc.), and even options like the delimiter for CSV files. On the cluster side, factors like the number of nodes, the performance of their disks, network bandwidth between nodes, and the max_sql_memory_bytes setting all play a crucial role.
To truly tune performance, you need to understand the IMPORT INTO options. For CSV, you can specify DELIMITER, QUOTE, ESCAPE, and NULL. For example, if your CSV uses a semicolon as a delimiter:
IMPORT INTO users (id, username, email)
FROM CSV '/path/to/your/users.csv' WITH DELIMITER ';';
You can also control how the import is processed. By default, IMPORT INTO attempts to parallelize across all available nodes. However, if your data has a natural ordering or you want to control the degree of parallelism, you might consider splitting your input file into multiple smaller files and running multiple IMPORT INTO statements concurrently, or using the parallelism option if available in future versions.
The most surprising thing about IMPORT INTO is its ability to bypass much of the typical SQL execution engine overhead. It doesn’t parse SQL statements for each row, it doesn’t acquire row locks in the traditional sense during the bulk operation, and it batches writes at a very low level. This is why it’s orders of magnitude faster than repeated INSERT statements. It’s essentially a data loading utility that uses the CockroachDB cluster as its backend storage, rather than a standard SQL query.
When you’re importing large amounts of data, especially if the data is already sorted or has a predictable key distribution, CockroachDB will try to optimize the placement of that data. However, if your import process creates a massive amount of data for a single range, you might find yourself needing to rebalance those ranges after the import is complete to ensure even distribution and optimal query performance.