ElasticSearch or Couchbase or something else

Background: I have a huge data stream - up to 1,000,000 records per hour, ttl - 3 hours ... Each "document" contains about 20 properties, I need to simultaneously search up to 15 properties using "==", IN "and" BETWEEN " .

Since in most cases there are no unexplored properties, there is no reason to store the document twice (in Couchbase AND in the ElasticSearch index), so I think it's a good idea to store it only in ElasticSearch. I'm right?

Or maybe someone can recommend me the best database for such a task? I need easy horizontal scaling in the future (MySQL custom shading is not an option) ... This data is a kind of cache, so possible consistency and low durability are fine ...

According to the CAP theorem, I mainly need A and P ...

+6
source share
2 answers

As for productivity, if you use equipment of the appropriate size, you should not have problems with indexing 1M documents per hour. I ran Elasticsearch much higher, no problem. The following is a detailed entry that may be useful in benchmarking and sizing a large Elasticsearch cluster:

Set up ElasticSearch for a large cluster with heavy aggregations

For an ephemeral caching system with TTL in just 3 hours, I agree that it would be a waste to store data in multiple repositories. You can store data in Couchbase and copy it to Elasticsearch in real time or in real time, but why bother? You do not know what benefit you will get from having data in both places.

For performance issues regarding your specific use case, I would highly recommend benchmarking. One Elasticsearch (and Solr) power that I found is their (for me) surprisingly strong performance when searching across multiple non-text fields. You usually think of ES for text search purposes (where it succeeds), but also a decent general-purpose database. I found that, in particular, it has high performance when searching by several parameters compared to some other NoSQL solutions.

Personally, comparing ES in this case, I would look at several different indexing options. ES supports TTL for documents, so automatically clearing the cache is simple:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

However, you can play with different indexes for each hour - one thing about ES (due to the use of Lucene to index and store files) is that it removes the work differently than most databases. Documents are marked as deleted, but not deleted, and then periodically the files under it (called segments) will be merged, and at this time new segments will be created without deleted documents. This can lead to a significant increase in disk activity for large volumes of deleted use cases in a single index. The way around this is to create a new index for each hour, and then delete the entire index in it after the data in it exceeds 3 hours.

You may find this previous discussion about TTL and time series indices in Elasticsearch useful: Performance issues using Elasticsearch as a time window repository

Finally, the relatively easy horizontal scaling of Elasticsearch is very good here - you add a new node with the correct cluster name, and ES takes care of everything else, automatically transferring the fragments to the new node. In your use case, you can play with the replication rate, since more replicas for more nodes are an easy way to improve query performance.

+6
source

In the case of using a cache (cache-like system), I think that Elasticsearch will only give you problems in the future. I assume that you do not need indexing at all, since you do not look at such search functions.

I did not use Couchbase, but I heard good things about it. I have heard usage examples, such as using Couchbase for more filtering purposes, and Elasticsearch for finding a more suitable one for finding a target (and what Couchbase cannot do).

For scalability, as far as I can tell, both look the same in terms of a very high level. Both support easy configuration and replication with rebalancing of the fragments and pushing the secondary replica to primary when the node in the cluster goes down. The specifics may vary.

But honestly, you will have to try yourself and experience with production like traffic. I worked with Elasticsearch, and I know that you can’t always just say if it is the right choice for your use case, because how it behaves for the application in production may differ from how it behaves for other in terms of performance.

But I think you're on the right track.

+1
source

Source: https://habr.com/ru/post/972673/


All Articles