Choosing a Database for Big Data

I have many text files, their total size is about 300 GB ~ 400 GB. They are all in this format.

key1 value_a key1 value_b key1 value_c key2 value_d key3 value_e .... 

each row consists of a key and a value. I want to create a database that can allow me to query all key values. For example, when I request key1, the values ​​are_a, value_b, and value_c.

First of all, inserting all of these files into the database is a big problem. I am trying to insert several GBs sized blocks into a MySQL MyISAM table with LOAD DATA INFILE syntax. But it seems MySQL cannot use multicore devices to insert data. It's as slow as hell. So, I think MySQL is not a good choice for so many records.

In addition, I need to update or recreate the database periodically, weekly or even daily, if possible, so the input speed is important to me.

It is not possible for a single node to perform calculations and insertion efficiently in order to be efficient, I believe that it is better to perform the insertion in different nodes in parallel.

For instance,

 node1 -> compute and store 0-99999.txt node2 -> compute and store 10000-199999.txt node3 -> compute and store 20000-299999.txt .... 

So, here are the first criteria.

Criteria 1. Fast insertion speed in distributed batch mode.

Then, as you can see in the example text file, it is better to provide several identical keys for different values. Same as key1 maps to value_a / value_b / value_c in this example.

Criteria 2. Multiple keys allowed

Then I will need to request the keys in the database. No relational or complex connection request is required, all I need is a simple key / value request. The important part is that a plural key with the same value

Criteria 3. Simple and quick key verification.

I know that there is HBase / Cassandra / MongoDB / Redis .... and so on, but I am not familiar with all of them, I don’t know which one suits me. So the question is which database to use? If none of them meets my needs, I even plan to build my own, but it takes effort: /

Thanks.

+6
source share
6 answers

There may be many systems that suit your needs. Your requirements make things pleasant to use in several ways:

  • Since you do not need any cross-key operations, you can use several databases, sharing keys between them using a hash or range. This is an easy way to solve the problem of lack of parallelism that you have observed with MySQL, and is likely to observe many other database systems.
  • Since you never do any online updates, you can simply build an immutable database in bulk and then query it before the end of the day / week. I expect you to get much better performance this way.

I would be inclined to create a set of LevelDB tables with hash settings. That is, I would not use the actual leveldb::DB , which supports a more complex data structure (table stack and log) so that you can do online updates; instead, I would use the leveldb::Table and leveldb::TableBuilder objects directly (no log, only one table for the given key). This is a very efficient format for queries. And if your input files are already sorted, as in your example, building a table will also be extremely efficient. You can achieve the desired parallelism by increasing the number of shards β€” if you are using a 16-core 16-disk machine to create a database, then use at least 16 shards that are generated in parallel. If you are using 16 16-core 16-disk machines, at least 256 shards. If these days you have a lot less drives than many people, try both, but you can find fewer shards to avoid crashes. If you are careful, I think that you can basically maximize disk throughput when creating tables, and that says a lot, since I expect the tables to be noticeably smaller than your input files due to compression of the key prefix (and maybe Snappy block compression). You generally avoid searches, because, in addition to the relatively small index that you can usually buffer in RAM, the keys in leveldb tables are stored in the same order you read them from the input files, assuming that your input files are already sorted. If this is not the case, you may need enough shards so you can sort the shard into RAM, and then write them down, possibly processing the shards more consistently.

+3
source

I would advise you to use SSDB ( https://github.com/ideawu/ssdb ), a leveldb server suitable for storing data collections.

You can save data on maps:

 ssdb->hset(key1, value1) ssdb->hset(key1, value2) ... list = ssdb->hscan(key1, 1000); // now list = [value1, value2, ...] 

SSDB is fast (half the speed of Redis, 30,000 inserts per second), this is the leveldb network shell, single-line installation and startup. His clients are PHP, C ++, Python, Java, Lua, ...

+1
source

The traditional answer would be to use Oracle if you have a lot of money or PostgreSQL if you do not. However, I would advise you to also look at solutions like mongoDb, which seemed to me to grow rapidly, and will also contain a scenario in which your schema is not fixed and may change according to your data.

0
source

Since you are already familiar with MySQL, I suggest trying all the MySQL options before upgrading to a new system. Many bigdata systems are configured for very specific problems, but do not work very well in areas that are taken for granted from an RDBMS. In addition, most applications require regular RDBMS functions along with bigdata functions. Therefore, the transition to a new system can create new problems.

Also consider the software ecosystem, community support, and knowledge base available on the system of your choice.

Returning to the decision, how many rows will be in the database? This is an important indicator. I accept over 100 million.

Try Partitioning . It can help a lot. The fact that your selection criteria are simple and you do not require connections to only improve the situation.

Postgres has a good way to handle partitions. This requires more code to run and run, but gives amazing control. Unlike MySQL, Postgres does not have a hard limit on the number of partitions. Partitions in Postgres are regular tables. This gives you much more control over indexing, searching, backing up, restoring, accessing parallel data, etc.

0
source

Take a look at HBase . You can store multiple values ​​by key using columns. Unlike RDBMS, you do not need to have a fixed set of columns in each row, but it can have an arbitrary number of columns for a row. Since you are requesting data for a key (row-row in HBase), you can get all the values ​​for a given key by reading the values ​​of all the columns in that row.

HBase also represents the concept of a retention period, so you can determine which columns live for a long time. Therefore, the data can be cleared independently, if necessary. There are several interesting methods that people have used to use retention periods.

HBase is quite scalable and supports very fast read and write.

0
source

InfoBright might be a good choice.

0
source

Source: https://habr.com/ru/post/912466/


All Articles