SQLite insertion speed slows as the number of records increases due to index

Original question

Background

It is well known that SQLite needs to be configured in order to achieve an insertion speed of about 50 thousand inserts / s. There are a lot of questions regarding slow insertion speeds and a lot of tips and tests.

There are also claims that SQLite can process large amounts of data , with reports of 50+ GB that do not cause problems with the correct settings.

I followed tips here and elsewhere to achieve these speeds, and I am pleased with the 35k-45k inserts. The problem is that all benchmarks show only fast insertion speeds with <1 m records. I see that the insertion speed is apparently inversely proportional to the size of the table.

Question

In my use case, I need to store from 500 to 1 bit of tuples ( [x_id, y_id, z_id] ) for several years (1 m rows / day) in the link table. Values ​​are integer identifiers from 1 to 2,000,000. There is one index on z_id .

Performance is great for the first 10mm rows, ~ 35k inserts / s, but by the time the table has ~ 20m rows, performance is beginning to suffer. Now I see about 100 inserts / s.

The size of the table is not very large. With 20 meter rows, the disk size is about 500 MB.

The project is written in Perl.

Question

Is this the reality of large tables in SQLite or are there any secrets to maintaining high insertion rates for tables s> 10 m rows?

Known workarounds I'd like to avoid if possible

  • Drop the index, add records and reindex: this is normal as a workaround, but does not work when the database still needs to be used during updates. This will not work to make the database completely inaccessible for x minutes / days
  • Split a table into smaller subtitles / files: this will work in the short term, and I already experimented with it. The problem is that I need to be able to retrieve data from the entire history upon request, which means that in the end I will remove the table binding constraint 62. Attaching, collecting results in the temp table and disconnecting hundreds of times per query seems to be a lot of work and overhead, but I will try if there are no other alternatives.
  • Set SQLITE_FCNTL_CHUNK_SIZE : I do not know C (?!), So I would prefer not to learn it just to do it. I see no way to set this parameter using Perl.

UPDATE

After Tim’s suggestion that the index caused slower insertion times, even though SQLite claims to be capable of processing large data sets, I made a comparative comparison with the following Parameters:

  • inserted rows: 14 million
  • record batch size: 50,000 records
  • cache_size pragma: 10,000
  • page_size pragma: 4,096
  • temp_store pragma: memory
  • journal_mode pragma: delete
  • synchronous pragma: off

In my project, as in the test results below, I create a temporary file-based table and built-in SQLite support for importing CSV data. Then a temporary table is attached to the receiving database and sets of 50,000 rows are inserted with insert-select . Therefore, the insert times do not reflect the file to be inserted into the database, but instead of the table in the table, the speed. Taking into account the time of CSV import, you can reduce the speed by 25-50% (a very rough estimate, it does not take much time to import CSV).

Obviously, the presence of an index slows down the insertion speed with increasing table size.

Plot of SQLite insert speed and table size

From the above data it is clear that the correct answer can be attributed to Tim's answer , and not to the assertions that SQLite simply can not handle it. Obviously, it can handle large data sets if indexing this data set is not part of your use case. I only use SQLite for this, as a backend for the logging system, for a while that does not need to be indexed, so I was very surprised at the slowdown I experienced.

Conclusion

If someone discovers that he wants to store a large amount of data using SQLite and index it using shards, this might be the answer. In the end, I decided to use the first three characters of the MD5 hash in the z column to determine the purpose of one of the 4096 databases. Since my use case is primarily archival in nature, the scheme will not change, and requests will never require a sharing walk. The database size limit is limited, as extremely old data will be reduced and eventually discarded, so this combination of sharding, pragma and even some denormalization gives me a good balance, which, based on the comparative analysis above, supports the insertion speed of at least 10 thousand inserts per second.

+42
optimization database insert sqlite sqlite3
Apr 03 '13 at 4:12
source share
5 answers

If you need to find a specific z_id and its associated x_ids and y_ids (as opposed to quickly selecting a range of z_ids), you can look at the unindexed hash table of nested relational db, which would allow you to instantly find your path to a specific z_id, to get it y_ids and x_ids - without indexing overhead and concomitant degraded performance during insertions as the index grows. To avoid collisions with random collisions, choose a key hashing algorithm that puts the most weight on the z_id digits with the largest change (right-weighted).

PS A database using a b-tree may first display faster than a db that uses linear hashing, say, but insert performance will remain at the linear hash level, as performance on the b-tree starts to deteriorate.

PPS To answer the kawing-chiu question: the main feature that matters here is that such a database is based on so-called β€œsparse” tables, in which the physical location of the record is determined by the hash algorithm, which takes the record key as input . This approach allows you to directly search for a record in a table without using an index. Since there is no need to go through indexes or rebalancing indices, insertion times remain constant as the table becomes more densely populated. With the b-tree, in contrast, insertion time worsens as the index tree grows. OLTP applications with a large number of concurrent inserts can benefit from this sparse table approach. Entries are scattered throughout the table. The disadvantage of records scattered across the tundra of a sparse table is that collecting large sets of records that have common value, such as a postal code, can be slower. The hashed approach with a sparse table is optimized for inserting and retrieving individual records and for retrieving networks of related records, rather than from large sets of records that have some field value.

A nested relational database is one that allows tuples inside a row column.

+10
Apr 04 '13 at 11:25
source share

Great question and a very interesting sequel!

I just would like to make a brief remark: you mentioned that splitting a table into smaller subtitles / files and their subsequent attachment is not an option, because you will quickly reach the hard limit of 62 attached databases. Although this is completely true, I don’t think that you have considered the intermediate option: outlining data into several tables, but continue to use the same single database (file).




I made a very rough assessment to make sure that my proposal really affects performance.

Scheme:

 CREATE TABLE IF NOT EXISTS "test_$i" ( "i" integer NOT NULL, "md5" text(32) NOT NULL ); 

Data - 2 million rows:

  • i = 1..2,000,000
  • md5 = md5 hex digest i

Each transaction = 50,000 INSERT s.




Databases: 1; Tables: 1; Indexes: 0

 0..50000 records inserted in 1.87 seconds 50000..100000 records inserted in 1.92 seconds 100000..150000 records inserted in 1.97 seconds 150000..200000 records inserted in 1.99 seconds 200000..250000 records inserted in 2.19 seconds 250000..300000 records inserted in 1.94 seconds 300000..350000 records inserted in 1.94 seconds 350000..400000 records inserted in 1.94 seconds 400000..450000 records inserted in 1.94 seconds 450000..500000 records inserted in 2.50 seconds 500000..550000 records inserted in 1.94 seconds 550000..600000 records inserted in 1.94 seconds 600000..650000 records inserted in 1.93 seconds 650000..700000 records inserted in 1.94 seconds 700000..750000 records inserted in 1.94 seconds 750000..800000 records inserted in 1.94 seconds 800000..850000 records inserted in 1.93 seconds 850000..900000 records inserted in 1.95 seconds 900000..950000 records inserted in 1.94 seconds 950000..1000000 records inserted in 1.94 seconds 1000000..1050000 records inserted in 1.95 seconds 1050000..1100000 records inserted in 1.95 seconds 1100000..1150000 records inserted in 1.95 seconds 1150000..1200000 records inserted in 1.95 seconds 1200000..1250000 records inserted in 1.96 seconds 1250000..1300000 records inserted in 1.98 seconds 1300000..1350000 records inserted in 1.95 seconds 1350000..1400000 records inserted in 1.95 seconds 1400000..1450000 records inserted in 1.95 seconds 1450000..1500000 records inserted in 1.95 seconds 1500000..1550000 records inserted in 1.95 seconds 1550000..1600000 records inserted in 1.95 seconds 1600000..1650000 records inserted in 1.95 seconds 1650000..1700000 records inserted in 1.96 seconds 1700000..1750000 records inserted in 1.95 seconds 1750000..1800000 records inserted in 1.95 seconds 1800000..1850000 records inserted in 1.94 seconds 1850000..1900000 records inserted in 1.95 seconds 1900000..1950000 records inserted in 1.95 seconds 1950000..2000000 records inserted in 1.95 seconds 

Database file size: 89.2 MiB.




Databases: 1; Tables: 1; Indexes: 1 ( md5 )

 0..50000 records inserted in 2.90 seconds 50000..100000 records inserted in 11.64 seconds 100000..150000 records inserted in 10.85 seconds 150000..200000 records inserted in 10.62 seconds 200000..250000 records inserted in 11.28 seconds 250000..300000 records inserted in 12.09 seconds 300000..350000 records inserted in 10.60 seconds 350000..400000 records inserted in 12.25 seconds 400000..450000 records inserted in 13.83 seconds 450000..500000 records inserted in 14.48 seconds 500000..550000 records inserted in 11.08 seconds 550000..600000 records inserted in 10.72 seconds 600000..650000 records inserted in 14.99 seconds 650000..700000 records inserted in 10.85 seconds 700000..750000 records inserted in 11.25 seconds 750000..800000 records inserted in 17.68 seconds 800000..850000 records inserted in 14.44 seconds 850000..900000 records inserted in 19.46 seconds 900000..950000 records inserted in 16.41 seconds 950000..1000000 records inserted in 22.41 seconds 1000000..1050000 records inserted in 24.68 seconds 1050000..1100000 records inserted in 28.12 seconds 1100000..1150000 records inserted in 26.85 seconds 1150000..1200000 records inserted in 28.57 seconds 1200000..1250000 records inserted in 29.17 seconds 1250000..1300000 records inserted in 36.99 seconds 1300000..1350000 records inserted in 30.66 seconds 1350000..1400000 records inserted in 32.06 seconds 1400000..1450000 records inserted in 33.14 seconds 1450000..1500000 records inserted in 47.74 seconds 1500000..1550000 records inserted in 34.51 seconds 1550000..1600000 records inserted in 39.16 seconds 1600000..1650000 records inserted in 37.69 seconds 1650000..1700000 records inserted in 37.82 seconds 1700000..1750000 records inserted in 41.43 seconds 1750000..1800000 records inserted in 49.58 seconds 1800000..1850000 records inserted in 44.08 seconds 1850000..1900000 records inserted in 57.17 seconds 1900000..1950000 records inserted in 50.04 seconds 1950000..2000000 records inserted in 42.15 seconds 

Database file size: 181.1 MiB.




Databases: 1; Tables: 20 (one per 100,000 entries); Indexes: 1 ( md5 )

 0..50000 records inserted in 2.91 seconds 50000..100000 records inserted in 10.30 seconds 100000..150000 records inserted in 10.85 seconds 150000..200000 records inserted in 10.45 seconds 200000..250000 records inserted in 10.11 seconds 250000..300000 records inserted in 11.04 seconds 300000..350000 records inserted in 10.25 seconds 350000..400000 records inserted in 10.36 seconds 400000..450000 records inserted in 11.48 seconds 450000..500000 records inserted in 10.97 seconds 500000..550000 records inserted in 10.86 seconds 550000..600000 records inserted in 10.35 seconds 600000..650000 records inserted in 10.77 seconds 650000..700000 records inserted in 10.62 seconds 700000..750000 records inserted in 10.57 seconds 750000..800000 records inserted in 11.13 seconds 800000..850000 records inserted in 10.44 seconds 850000..900000 records inserted in 10.40 seconds 900000..950000 records inserted in 10.70 seconds 950000..1000000 records inserted in 10.53 seconds 1000000..1050000 records inserted in 10.98 seconds 1050000..1100000 records inserted in 11.56 seconds 1100000..1150000 records inserted in 10.66 seconds 1150000..1200000 records inserted in 10.38 seconds 1200000..1250000 records inserted in 10.24 seconds 1250000..1300000 records inserted in 10.80 seconds 1300000..1350000 records inserted in 10.85 seconds 1350000..1400000 records inserted in 10.46 seconds 1400000..1450000 records inserted in 10.25 seconds 1450000..1500000 records inserted in 10.98 seconds 1500000..1550000 records inserted in 10.15 seconds 1550000..1600000 records inserted in 11.81 seconds 1600000..1650000 records inserted in 10.80 seconds 1650000..1700000 records inserted in 11.06 seconds 1700000..1750000 records inserted in 10.24 seconds 1750000..1800000 records inserted in 10.57 seconds 1800000..1850000 records inserted in 11.54 seconds 1850000..1900000 records inserted in 10.80 seconds 1900000..1950000 records inserted in 11.07 seconds 1950000..2000000 records inserted in 13.27 seconds 

Database File Size: 180.1 MiB.




As you can see, the insertion rate remains fairly constant if you outline the data into several tables.

+5
Jun 14 '13 at 13:47 on
source share

Unfortunately, I would say that this is a limitation of large tables in SQLite. It is not designed to work with large-scale or large volumes of data. Although I understand that this can greatly increase the complexity of the project, you are probably better off exploring more sophisticated database solutions to suit your needs.

From everything you link, it seems that the size of the table for accessing speed is a direct compromise. It cannot be both.

+2
Apr 03 '13 at 4:28
source share

In my project, I could not outline the database because it was indexed in different columns. To speed up the insertion, I placed the database during creation on / dev / shm (= linux ramdisk), and then copied it to the local drive. Obviously, this only works for the write-once, read-many database.

+1
Jul 08 '15 at 16:21
source share

I suspect that a hash index collision causes a slow insertion speed.

When we have many rows in one table, and a hash index column with an indexed column will occur more often. This means that the Sqlite engine must calculate the hash value two or three times, or maybe even four times, to get a different hash value.

So, I think this is the main reason SQLite slows down when there are a lot of rows in the table.

This point may explain why the use of fragments could have avoided this problem. Who is a true expert in the SQLite domain to confirm or refute my point?

0
Mar 11 '16 at 15:29
source share



All Articles