Performance issues using Elasticsearch as a temporary window store

We use elastic search almost like a cache, preserving documents found in the time window. We constantly insert many documents of different sizes, and then we search the ES using text queries in combination with a date filter so that the current stream does not receive documents that it has already seen. Something like that:

"((word1 And word 2) OR (word3 And word4)) And insertedDate> 1389000"

We store data in elastic search for 30 minutes using the TTL function. Today we have at least three machines, each of which inserts new documents into the mass queries for each machine every minute and searches continuously with the help of queries such as the one indicated above.

We have a lot of problems indexing and retrieving these documents, we are not getting a good amount of throughput for documents indexed and returned by ES. We cannot even get 200 documents indexed per second.

We believe that the problem lies in the simultaneous request, insertion and deletion of TTL. We do not need to store old data in an elastic state, we just need a small document window indexed in an elastic state at a given time. What should we do to improve our productivity?

Thank you in advance

Machine type:

  • Mid-Range Amazon EC2 Instances (3.7 GB RAM)

Additional Information:

EDIT

Sorry for the long delay to give you a couple of feedback. Everything was awkward in our company, and I decided to wait for calmer times to give a more detailed account of how we solved our problem. We still need to make some guidelines for assessing actual improvements, but the fact is that we solved the problem :)

First of all, I believe that indexing performance issues were caused by a side-use error. As I said, we used Elasticsearch as a kind of cache to search for documents in a 30-minute time window. We searched for documents in elasticsearch, whose content matched some query and whose insertion date was in a certain range. Then Elastic will return the full json document to us (which contains a lot of data besides indexed content). Our configuration mistakenly indexed the json document field (besides the content and insertDate fields), which, in our opinion, were the main cause of indexing performance issues.

However, we also made a number of changes, as suggested in the answers here, which, in our opinion, also improved performance:

  • Now we are not using the TTL function and instead use two “sliding indexes” under a common alias. When the index ages, we create a new one, assign it an alias, and delete the old one.

  • In our application, a huge number of requests per second. We believe that this strikes and worsens indexing performance (since we use only one node for elastic search). We used 10 fragments for node, which led to the fact that each request that we shot elastic was translated into 10 requests, one for each fragment. Since we can discard data in an elastic state at any time (thus, changing the number of fragments is not a problem for us), we simply changed the number of fragments to 1, which significantly reduced the number of queries in our elastic node.

  • We had 9 mappings in our index, and each query would be run for a specific match. Of these 9 comparisons, about 90% of the inserted documents fell into two of these comparisons. We created a separate rolling index for each of these comparisons and left the remaining 7 in the same index.

  • This is actually not a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor the elastic search and study important indicators, such as the number of queries launched → sematext.com/spm/index.html

Our usage numbers are relatively small. We have about 100 documents per second that need to be indexed, with peaks of 400 documents per second. As for searches, we have about 1,500 queries per minute (15,000 before changing the number of fragments). Before these changes, we ran into these performance issues, but no more.

+3
source share
3 answers

TTL for time series indexes

You should use time series indexes rather than the TTL function. Given that you care about the last 30-minute windows of documents, create a new index every 30 minutes using a date / time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)

Using the Elasticsearch index aliases function ( http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/ ), you can add the docs alias to the last created index so that when you index, for example, a large volume, you always use the alias docs , but they will be recorded, for example, on docs-201309120130 .

When prompted, you must filter the datetime field to ensure that only the last 30 minutes of documents are returned, and you will need to query the last two created indexes to ensure that you get the full 30 minutes of documents - you can create another alias here to specify into two indexes, or simply query directly for two index names.

In this model, you do not have the overhead of using TTL, and you can simply delete the old unused indexes more than an hour ago.

There are other ways to improve indexing speed and query speed, but I think that removing TTL will be the biggest gain, your indexes will have a limited amount of data to filter / query, which should provide good speed acceleration.

Search settings (e.g. memory, etc.)

Here are some settings that I usually configure for servers with ES - http://pastebin.com/mNUGQCLY , note that this is only for 1 GB of VPS, so you will need to adjust.

Node roles

The "vs" and "client" ES node types can help you - http://www.elasticsearch.org/guide/reference/modules/node/

Index Settings

When doing bulk inserts, consider changing the values ​​as index.refresh_interval index.merge.policy.merge_factor - I see that you changed refresh_interval to 5s , but think about setting it to -1 before the volume indexing operation, and then return to the desired interval. Or just think of doing a manual _refresh API attack after your bulk operation, especially if you only do voluminous inserts every minute - this is a controlled environment in this case.

By index.merge.policy.merge_factor to a higher value, the amount of segment merging performed by ES in the background is reduced, and then it returns to the default by default after the bulk operation restores normal behavior. Parameter 30 usually recommended for bulk inserts, and the default value is 10 .

+11
source

Some other ways to improve Elasticsearch performance Elasticsearch :

  • increase the index update interval. A transition from 1 second to 10 or 30 seconds can significantly affect performance.
  • throttle merge if it is too aggressive. You can also reduce the number of simultaneous merges by omitting index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit . index.merge.scheduler.max_thread_count may also help

Good to see that you are using SPM. Its URL in your EDIT was not a hyperlink - it was http://sematext.com/spm . The " Indexing " graphs show how changing merge-related parameters affects performance.

+1
source

I would run an additional instance of ES and form a cluster with the current node. Then I would split the work between the two machines, use one for indexing and the other for queries. See how it works for you. You may need to scale your specific usage patterns even further.

0
source

Source: https://habr.com/ru/post/972676/


All Articles