Efficient imports of data into MySQL (and MariaDB)

Efficient imports of data into MySQL (and MariaDB)

There are several ways to import big record sets into MySQL tables. In this article we look at three approaches and the effect of having indices configured before or after import on duration.

First three methods will be discussed. After that the adding of indices and the swapping of tables and it concludes with some benchmark results.

Single record inserts

The naive approach would be to perform a single insert call for each record to be inserted.

INSERT INTO TABLE
    person (... columns ...)
VALUES
    (... row ...)

The steps are:

  1. process record (for example: read from CSV)
  2. perform insert query
  3. wait for insert to finish and go to 1

When working with just a few records this method is fine. With more records the individual round trips to the database server start hurting performance. On top of the networking penalty of these round trips the database engine will also lock/unlock the table for each insert and update indices if these are configured.

In the benchmark this method proves to be significantly slower than the other methods discussed in this article.

Batch record inserts

An obvious improvement of the naive approach would be to insert multiple records at once at each insert query.

For example one could construct a single insert query with multiple records:

INSERT INTO TABLE
    person (... columns ...)
VALUES
    (... row 1 ...), (... row 2 ...)

How many rows one should include in a single batch depends on the average size on the records. Too little rows and you will end up with too many round trips but too many rows might make the process run out of memory.

This method scores much better than the single record inserts and is only slightly slower than the following bulk import method.

Bulk import

MySQL can directly process rows from files and insert them. Instead of doing several round trips for individual or batch inserts we can instead tell MySQL what data to load and in what format the data is.

You can load data either from a file on the database server (same physical system) or from the system your process is running from. The prior is faster (bypasses the networking) but in general we don't have direct access to the file system of our database servers.

The bulk import is performed with the LOAD DATA statement:

LOAD DATA LOCAL INFILE '/path/to/people.csv'
INTO TABLE person
FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'

The "LOCAL" part tells mysql to use a file from the client/processor. We also provide some details about our CSV file format.

There are some more options, please see the official documentation for all configuration parameters: dev.mysql.com/doc/refman/8.0/en/load-data.h..

The benchmarks show that this is the most efficient method (out of the the discussed three) to import big record sets.

Indices

Inserting data in a table that already has indices added is slower than first adding the data and only after that is done adding the indices. Whenever it is possible, for example when importing into empty tables, try to add the indices after the import is finished.

Swapping of tables

When you are completely replacing the contents of a table with a new import it pays off to import the data first in a temporary table, add the indices and then use the rename table statement to swap the tables in one go.

RENAME TABLE
    person TO person_old,
    person_import TO person;

MySQL will perform this in one go while locking and unlocking the table so other processes accessing the person table are not affected. In general the swap is instantaneous.

Benchmarks

To test out the three discussed import methods I wrote some PHP scripts to import 2 million randomly generated person records. It uses docker and the benchmarks are executed by running the run.sh script in the root folder of the project.

The project can be found at: github.com/rbruinier/MySQLBulkImportBenchmark

MySQL Results

METHOD INSERT (SEC) ADDING INDICES (SEC) TOTAL (SEC)
bulk (indices after import) 10.1 16.4 26.5
bulk (indices before import) 111.5 - 111.5
batch (indices after import) 23.4 14.7 38.1
batch (indices before import) 142.1 - 142.1
single* (indices after import 1624.5 10.9 1635.5
single* (indices before import) 1655.2 - 1655.2

MariaDB Results

METHOD INSERT (SEC) ADDING INDICES (SEC) TOTAL (SEC)
bulk (indices after import) 8.9 16.1 25.1
bulk (indices before import) 113.7 - 113.7
batch (indices after import) 23.9 14.2 38.0
batch (indices before import) 144.0 - 144.0
single* (indices after import) 1518.2 11.5 1529.8
single* (indices before import) 1592.5 - 1592.5

* The single insert benchmarks are only performed on a subset of the records and the results are scaled to the full record set. This explains why the adding of indices is slightly faster for these benchmarks.

From the benchmarks it can be concluded that the bulk import (load data) is the most efficient way to import big record sets. It also shows the impact of having indices already in place before importing.

So even if the record set is not delivered as a CSV file (or a format that is loadable by MySQL) it is worth converting it to a CSV first and then use the bulk import functionality.

Conclusion

The ideal steps for an efficient import (of replaceable data) would look like the following:

  1. Parse data into CSV format (if not yet in that format)
  2. Create temporary import table without indices
  3. Perform bulk import into temporary table (use LOAD DATA statement)
  4. Add indices
  5. Perform swap (use RENAME TABLE statement)