How to reduce Elasticsearch scroll response time?

I have a query that returns ~ 200K hits from 7 different indexes distributed across our cluster. I treat my results as:

while (true) { scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(600000)).execute().actionGet(); for (SearchHit hit : scrollResp.getHits()){ //process hit} //Break condition: No hits are returned if (scrollResp.hits().hits().length == 0) { break; } } 

I notice that the client.prepareSearchScroll line can hang for quite some time before returning the next set of search queries. This seems to get worse the longer I run the code for.

My setup for searching:

 SearchRequestBuilder searchBuilder = client.prepareSearch( index_names ) .setSearchType(SearchType.SCAN) .setScroll(new TimeValue(60000)) //TimeValue? .setQuery( qb ) .setFrom(0) //? .setSize(5000); //number of jsons to get in each search, what should it be? I have no idea. SearchResponse scrollResp = searchBuilder.execute().actionGet(); 

Is it expected that when viewing many results, scanning and scrolling will take a long time? I am very new to Elastic Search, so keep in mind that I may be missing something very obvious.

My request:

 QueryBuilder qb = QueryBuilders.boolQuery().must(QueryBuilders.termsQuery("tweet", interesting_words)); 
+4
source share
4 answers

.setSize(5000) means that every call to client.prepareSearchScroll will retrieve 5000 records per fragment. You are asking for the source code, and if your records are large, assembling 5000 records in memory may take some time. I would suggest trying a smaller number. Try 100 and 10 to see if you get the best performance.

.setFrom(0) not required.

+6
source

I am going to add another answer here because I was very puzzled by this behavior and it took me a long time to find the answer in the comments of @AaronM

This applies to ES 1.7.2 using the java API.

I scrolled / looked at the index of 500 m records, but with a query that returns about 400 thousand rows.

I started with a scroll size of 1000, which seemed like a reasonable compromise in terms of network and processor.

This query was executed terribly slowly, taking about 30 minutes to complete, with very long pauses between cursor samples.

I was worried that maybe this was just the request that I was executing, and I didn’t believe that reducing the scroll size might help, as 1000 seemed tiny.

However, after seeing the AaronM comment above, I tried a scroll size of 10.

All work was completed after 30 seconds (and it was whether I restarted ES or not, so, presumably, nothing to do with caching) - the acceleration is about 60x !!!

So, if you have performance issues with scrolling / scanning, I highly recommend reducing the scroll size. I could not find much about it on the Internet, so I posted it here.

+1
source

You can read the document here.

SearchScrollRequestBuilder !

I think Timevalue is the time to keep scrolling alive

 setScroll(TimeValue keepAlive) 

If set, activates the scrolling of the search query for the specified timeout.

You can read here:

Scrolling

0
source
  • Use transport protocol rather than HTTP
  • Node request data is not a node client or node master
  • Select the required fields using the filter_path property
  • Set the scroll size according to the size of your document, there is no magic rule, you have to set the value and try, and so on.
  • Watch the network bandwidth.
  • If this is not enough, release some materials with several threads:

Think of the elasticsearch index as being composed of several fragments. This design means that you can parallelize work.

Say your index has 3 fragments, and your cluster nodes 3 (it’s good practice to have more nodes than fragments in the index).

You can run 3 "working" Java in a separate thread, each of which will look for a scroll of another fragment and node and use the queue to "centralize" the results.

This way you will have good performance!

This is what the elasticsearch-hadoop library does.

To get information about nodes and nodes about the index, use the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.html API.

0
source

Source: https://habr.com/ru/post/1447110/


All Articles