Is Hadoop a good candidate for use as a key store?

Question

Would Hadoop be a good candidate for the following use:

  • Simple keystore (first you need GET and SET by key)
  • Very small strings (32-byte key-value pairs)
  • Heavy removal
  • Heavy notes
  • About the order of 100 million to 1 billion key-value pairs
  • Most data can be stored on SSDs (solid state drives), not RAM.

More details

I ask because I continue to see links to the Hadoop file system and how Hadoop is used as the basis for many other database implementations that are not necessarily designed for Map-Reduce.

We are currently saving this data to Redis. Redis works great, but since it contains all its data in RAM, we must use expensive machines with 128 GB of memory. It would be nice to use a system that relies on SSD. Thus, we will have the freedom to build much larger hash tables.

We also saved this data with Cassandra, but Cassandra tends to break if deletions become too heavy.

+5
source share
2 answers

Hadoop (unlike popular media opinions) is not a database. What you are describing is a database. Therefore, Hadoop is not a good candidate for you. In addition, the next position is stubborn, so feel free to prove to me that I am wrong in the tests.

If you care about the "NoSql DB" that are on top of Hadoop:

  • HBase is suitable for heavy recording operations, but sucks huge deletions.
  • Cassandra is the same story, but not as fast as HBase
  • Accumulo may be useful for very frequent updates, but will also suck on removal.

None of them make "real" use of SSDs, I think that they all do not get huge acceleration from them.

They all suffer from expensive builds if you start fragmenting your tablets (in BigTable speech), so removal is a pretty obvious limiting factor.

What you can do to troubleshoot deletion problems is simply to overwrite the permanent "deleted" value that bypasses the compaction. However, your table is growing, which can be expensive on SSDs. You will also need to filter, which may affect read latency.

From what you described, Amazon DynamoDB architecture sounds like the best candidate here. Although removal here is also expensive, it may not be as strong as the above alternatives.

BTW: The recommended way to delete multiple rows from tables in any of the above databases is to simply delete the table completely. If you can put your design in this paradigm, any of them will do.

+3
source

Although this is not the answer to your question, but in the context of what you are talking about

It would be nice to use a system that relies on SSD. Here we would have the freedom to build much larger hash tables.

You might consider Project Voldemort . In particular, as a user of Cassandra, I know when you say Its the compaction and the tombstones that are a problem . I myself ran into a TombstoneOverwhelmingException several times and hit dead ends.

Maybe you should take a look at this Linked In article. It says:

Memcached is all in memory, so you need to compress all your data in memory to be able to serve it (which can be an expensive proposition if the generated dataset is large).

And finally

all we do is just mmap the entire data set in the process address space and access to it. This provides the smallest overhead caching possible, and uses very efficient search structures in the operating system.

I do not know if this is suitable for your business. But you might consider evaluating Voldemort! Good luck.

+1
source

Source: https://habr.com/ru/post/1203262/


All Articles