Original question
Background
It is well known that SQLite needs to be configured in order to achieve an insertion speed of about 50 thousand inserts / s. There are a lot of questions regarding slow insertion speeds and a lot of tips and tests.
There are also claims that SQLite can process large amounts of data , with reports of 50+ GB that do not cause problems with the correct settings.
I followed tips here and elsewhere to achieve these speeds, and I am pleased with the 35k-45k inserts. The problem is that all benchmarks show only fast insertion speeds with <1 m records. I see that the insertion speed is apparently inversely proportional to the size of the table.
Question
In my use case, I need to store from 500 to 1 bit of tuples ( [x_id, y_id, z_id] ) for several years (1 m rows / day) in the link table. Values ββare integer identifiers from 1 to 2,000,000. There is one index on z_id .
Performance is great for the first 10mm rows, ~ 35k inserts / s, but by the time the table has ~ 20m rows, performance is beginning to suffer. Now I see about 100 inserts / s.
The size of the table is not very large. With 20 meter rows, the disk size is about 500 MB.
The project is written in Perl.
Question
Is this the reality of large tables in SQLite or are there any secrets to maintaining high insertion rates for tables s> 10 m rows?
Known workarounds I'd like to avoid if possible
- Drop the index, add records and reindex: this is normal as a workaround, but does not work when the database still needs to be used during updates. This will not work to make the database completely inaccessible for x minutes / days
- Split a table into smaller subtitles / files: this will work in the short term, and I already experimented with it. The problem is that I need to be able to retrieve data from the entire history upon request, which means that in the end I will remove the table binding constraint 62. Attaching, collecting results in the temp table and disconnecting hundreds of times per query seems to be a lot of work and overhead, but I will try if there are no other alternatives.
- Set
SQLITE_FCNTL_CHUNK_SIZE : I do not know C (?!), So I would prefer not to learn it just to do it. I see no way to set this parameter using Perl.
UPDATE
After Timβs suggestion that the index caused slower insertion times, even though SQLite claims to be capable of processing large data sets, I made a comparative comparison with the following Parameters:
- inserted rows: 14 million
- record batch size: 50,000 records
cache_size pragma: 10,000page_size pragma: 4,096temp_store pragma: memoryjournal_mode pragma: deletesynchronous pragma: off
In my project, as in the test results below, I create a temporary file-based table and built-in SQLite support for importing CSV data. Then a temporary table is attached to the receiving database and sets of 50,000 rows are inserted with insert-select . Therefore, the insert times do not reflect the file to be inserted into the database, but instead of the table in the table, the speed. Taking into account the time of CSV import, you can reduce the speed by 25-50% (a very rough estimate, it does not take much time to import CSV).
Obviously, the presence of an index slows down the insertion speed with increasing table size.

From the above data it is clear that the correct answer can be attributed to Tim's answer , and not to the assertions that SQLite simply can not handle it. Obviously, it can handle large data sets if indexing this data set is not part of your use case. I only use SQLite for this, as a backend for the logging system, for a while that does not need to be indexed, so I was very surprised at the slowdown I experienced.
Conclusion
If someone discovers that he wants to store a large amount of data using SQLite and index it using shards, this might be the answer. In the end, I decided to use the first three characters of the MD5 hash in the z column to determine the purpose of one of the 4096 databases. Since my use case is primarily archival in nature, the scheme will not change, and requests will never require a sharing walk. The database size limit is limited, as extremely old data will be reduced and eventually discarded, so this combination of sharding, pragma and even some denormalization gives me a good balance, which, based on the comparative analysis above, supports the insertion speed of at least 10 thousand inserts per second.