ArangoDB Search Performance

We evaluate the performance of ArangoDB in the face computation space. There are many other products that can do the same using a special API or query language:

  • MarkLogic Borders
  • ElasticSearch Aggregation
  • Solr Faceting etc.

We understand that in Arango there is no special API for calculating facts. But this is not really necessary, thanks for the comprehensive AQL, which is easy to achieve with a simple query, for example:

FOR a in Asset COLLECT attr = a.attribute1 INTO g RETURN { value: attr, count: length(g) } 

This query computes the face of attribute1 and gives the frequency as:

 [ { "value": "test-attr1-1", "count": 2000000 }, { "value": "test-attr1-2", "count": 2000000 }, { "value": "test-attr1-3", "count": 3000000 } ] 

This suggests that throughout my collection1 attribute, three forms were accepted (test-attr1-1, test-attr1-2 and test-attr1-3) with the corresponding counts. To a large extent, we run the DISTINCT query and aggregate calculations.

It looks simple and clean. With only one, but really big problem - performance.

The above request works for 31 seconds! on top of a collection of tests with only 8M documents. We experimented with different types of indexes, storage systems (with and without rockdb), exploring explanation plans is futile. The test documents we use in this test are very short, with three short attributes.

We will be grateful for any input at this moment. Or we are doing something wrong. Or ArangoDB is simply not designed to run in that particular area.

btw, the ultimate goal would be to run something like the following the second time:

 LET docs = (FOR a IN Asset FILTER a.name like 'test-asset-%' SORT a.name RETURN a) LET attribute1 = ( FOR a in docs COLLECT attr = a.attribute1 INTO g RETURN { value: attr, count: length(g[*])} ) LET attribute2 = ( FOR a in docs COLLECT attr = a.attribute2 INTO g RETURN { value: attr, count: length(g[*])} ) LET attribute3 = ( FOR a in docs COLLECT attr = a.attribute3 INTO g RETURN { value: attr, count: length(g[*])} ) LET attribute4 = ( FOR a in docs COLLECT attr = a.attribute4 INTO g RETURN { value: attr, count: length(g[*])} ) RETURN { counts: (RETURN { total: LENGTH(docs), offset: 2, to: 4, facets: { attribute1: { from: 0, to: 5, total: LENGTH(attribute1) }, attribute2: { from: 5, to: 10, total: LENGTH(attribute2) }, attribute3: { from: 0, to: 1000, total: LENGTH(attribute3) }, attribute4: { from: 0, to: 1000, total: LENGTH(attribute4) } } }), items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}), facets: { attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a), attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a), attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a), attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a) } } 

Thanks!

+5
source share
1 answer

Turns out the main thread happened at ArangoDB Google Group. Here is the link for a full discussion.

Here is a summary of the current solution:

  • Run the custom Arango build from a specific feature branch where several performance improvements have been made (I hope that they will soon be in the main release)
  • Face calculations do not require indexes.
  • MMFiles - Preferred Storage Engine
  • AQL must be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length (g)"
  • AQL should be divided into smaller parts and run in parallel (we run Java8 Fork / Join to distribute AQL torches, and then attach them to the final result)
  • One AQL for filtering / sorting and retrieving the main object (if necessary, add the corresponding skiplist index when sorting / filtering)
  • The rest are small AQLs for each pair of facet / frequency pairs

As a result, we got a performance gain > 10x compared to the original AQL presented above.

+3
source

Source: https://habr.com/ru/post/1271799/


All Articles