Sqlite or mysql for large datasets

I work with large data sets (10 million million records, sometimes 100 million records) and want to use a database program that communicates well with R. I'm trying to solve between mysql and sqlite. The data is static, but there are many queries that I need to make.

This sqlite help link says that:

“With a default page size of 1024 bytes, the SQLite database is limited to 2 terabytes (241 bytes). And even if it can handle large databases, SQLite stores the entire database in one drive and many file systems limit the maximum file size to something smaller than that. Therefore, if you are considering databases of this magnitude, you should consider using a client / server database engine that distributes its contents across multiple disk files and possibly across multiple volumes. "

I'm not sure what that means. When I experimented with mysql and sqlite, it seems that mysql is faster, but I am not building very strict speed tests. I am wondering if mysql is a better choice for me than sqlite due to the size of my dataset. The above description seems to suggest that this may be so, but my data is not where it is around 2 TB.

There was a discussion in stackoverflow that dealt with this and referred to the same sqlite info page, but this did not completely fix this issue.

I would be grateful for an understanding of this limitation of the maximum file size from the file system and how this can affect the speed of indexing tables and query execution. It can really help me in deciding which database to use for my analysis.

+6
source share
4 answers

The SQLite database engine saves the entire database in a single file. This may not be very effective for incredibly large files (the SQLite limit is 2 TB, as you found in the help). In addition, SQLite is limited to one user at a time. If your application is based on a web interface or may be multithreaded (e.g. AsyncTask on Android), mysql is probably the way to go.

Personally, since you did the tests, and mysql is faster, I would just go with mysql. It will be more scalable in the future and allow you to do more.

+6
source

I'm not sure what that means. When I experimented with mysql and sqlite, it seems that mysql is faster, but I have not developed very strict speed tests.

Short short option:

  • If your application should fit on your phone or in some other embedded system, use SQLite. This is what it was intended for.

  • If your application may require more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.

+5
source

It seems that (in R, at least) that SQLite is awesome for ad hoc analysis. With RSQLite or sqldf really easy to download data and get started. But for the data that you will use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more options in terms of modifying your database (for example, adding or changing keys).

+3
source

SQL, if you mainly use this as a web service. SQLite if you want it to be able to work offline.

SQLite is usually much faster, since most (or ALL) of the data / indexes will be cached in memory. However, in the case of SQLite. If the data is divided into several tables or even several SQLite database files, from my experience so far. For even millions of records (so far I still have 100 million), this is much more efficient than SQL (compensate for the delay / etc). However, this is when records are bifurcated in different tables, and queries are specific to such tables (dun queries all tables).

An example is the database of elements used in a simple game. Although this may not sound like much, the UID will be released for even options. Thus, the generator will quickly generate over a million sets of “characteristics” with variations. However, this was mainly due to the fact that every 1000 recordsets were divided between different tables. (since we mainly retrieve records via UID). Although the performance of the splitting was not properly measured. We received queries that were 10 times faster than SQL (mainly due to network latency).

Amazingly, however, we ended up reducing the database to several thousand records, with the [pre-fix] / [suf-fix] parameter defining the options. (Like diablo, only that it was hidden). Which turned out to be much faster at the end of the day.

On the side of the note, however, my case was mainly due to the fact that the queues lined up one by one (expecting what was in front of him). If, however, you can simultaneously perform multiple connections / requests to the server. SQL performance degradation is more than compensated by you. Assuming that these requests do not branch / interact with each other (for example, if the result of the this, else that that request is received)

+1
source

Source: https://habr.com/ru/post/890288/