A storage engine for large amounts of permanently inserted data that should be available instantly

Our server (several Java applications on Debian) processes incoming data (GNSS observations), which should be:

  • immediately (delay <200 ms) delivered to other applications,
  • for future use.

Sometimes (several times a day) about a million archived records can be retrieved from the database. Record size - about 12 double-precision fields + timestamp and some identifiers. No UPDATE; REMOVALS are very rare but massive. Incoming stream up to one hundred records per second. Therefore, I had to choose a data storage mechanism.

I tried using MySQL (InnoDB). One application inserts, others constantly check the last record identifier and, if updated, retrieve new records. This part is working fine. But I ran into the following problems:

  • Records are quite large (about 200-240 bytes per record).
  • Retrieving millions of archived records is unacceptably slow (tens of minutes or more).

File storage will work very simply (since there are no inserts in the middle of the database, and the selection is mostly similar to "WHERE ID = 1 AND TIME BETWEEN 2000 AND 3000", but there are other problems:

  • Finding new data may not be so easy.
  • Other data, such as logs and configs, are stored in one database, and I prefer to have one database for everything.

Can you advise any suitable database engine (SQL is preferred but not needed)? Perhaps you can configure MySQL to reduce record size and timing for continuous data bands?

MongoDB is unacceptable because the size of the database is limited on 32-bit machines. Any engine that does not provide quick access to recently inserted data is also unacceptable.

+4
source share
2 answers

In fact, you can’t do what it takes to download millions of records from disk. Your 32-bit requirement means that you are limited by how much RAM you can use for data-based data structures. But, if you want to use MySQL, you can get good performance using several types of tables.

If you need really fast non-blocking inserts. You can use black hole table type and replication. The server on which the inserts occur has a black hole table type that replicates to another server that hosts the Innodb or MyISAM table.

Since you are not doing UPDATE, I think MyISAM will be better than Innodb in this scenario. You can use the MERGE table type for MyISAM (not available for Innodb). Not sure what a data set is, but you can have 1 table per day (hour, week?), Then the MERGE table will become a superset of these tables. Assuming you want to delete old data in the afternoon, just redefine the MERGE table to not include old tables. This action is instant. Removing old tables is also very fast.

To check for new data, you can look at the today table directly, rather than go through the MERGE table.

+2
source

I would recommend using the TokuDB storage engine for MySQL. It's free for 50GB of user data, and this pricing model isn't scary, making it a great choice for storing large amounts of data.

It increased insertion speed compared to InnoDB and MyISAM and improved significantly as the data set grew (InnoDB tends to get worse if the working data set does not match the RAM, which makes it work depending on the I / O of the hard disk subsystem).

It is also ACID compatible and supports multiple clustered indexes (which would be a great choice for the large DELETEs you plan to do). In addition, hot schema changes are supported (ALTER TABLE does not block tables, and changes on large tables change quickly - I say that gigabyte-sized tables change in a matter of seconds).

From my personal use, I experienced about 5-10 times less disk usage due to TokuDB compression, and it is much, much faster than MyISAM or InnoDB. Although it looks like I'm trying to advertise this product, I'm not just awesome, as you can use a monolithic datastore without expensive scaling plans such as node splitting to scale records.

+3
source

Source: https://habr.com/ru/post/1387710/


All Articles