I have many text files, their total size is about 300 GB ~ 400 GB. They are all in this format.
key1 value_a key1 value_b key1 value_c key2 value_d key3 value_e ....
each row consists of a key and a value. I want to create a database that can allow me to query all key values. For example, when I request key1, the values ββare_a, value_b, and value_c.
First of all, inserting all of these files into the database is a big problem. I am trying to insert several GBs sized blocks into a MySQL MyISAM table with LOAD DATA INFILE syntax. But it seems MySQL cannot use multicore devices to insert data. It's as slow as hell. So, I think MySQL is not a good choice for so many records.
In addition, I need to update or recreate the database periodically, weekly or even daily, if possible, so the input speed is important to me.
It is not possible for a single node to perform calculations and insertion efficiently in order to be efficient, I believe that it is better to perform the insertion in different nodes in parallel.
For instance,
node1 -> compute and store 0-99999.txt node2 -> compute and store 10000-199999.txt node3 -> compute and store 20000-299999.txt ....
So, here are the first criteria.
Criteria 1. Fast insertion speed in distributed batch mode.
Then, as you can see in the example text file, it is better to provide several identical keys for different values. Same as key1 maps to value_a / value_b / value_c in this example.
Criteria 2. Multiple keys allowed
Then I will need to request the keys in the database. No relational or complex connection request is required, all I need is a simple key / value request. The important part is that a plural key with the same value
Criteria 3. Simple and quick key verification.
I know that there is HBase / Cassandra / MongoDB / Redis .... and so on, but I am not familiar with all of them, I donβt know which one suits me. So the question is which database to use? If none of them meets my needs, I even plan to build my own, but it takes effort: /
Thanks.