We evaluate the performance of ArangoDB in the face computation space. There are many other products that can do the same using a special API or query language:
- MarkLogic Borders
- ElasticSearch Aggregation
- Solr Faceting etc.
We understand that in Arango there is no special API for calculating facts. But this is not really necessary, thanks for the comprehensive AQL, which is easy to achieve with a simple query, for example:
FOR a in Asset COLLECT attr = a.attribute1 INTO g RETURN { value: attr, count: length(g) }
This query computes the face of attribute1 and gives the frequency as:
[ { "value": "test-attr1-1", "count": 2000000 }, { "value": "test-attr1-2", "count": 2000000 }, { "value": "test-attr1-3", "count": 3000000 } ]
This suggests that throughout my collection1 attribute, three forms were accepted (test-attr1-1, test-attr1-2 and test-attr1-3) with the corresponding counts. To a large extent, we run the DISTINCT query and aggregate calculations.
It looks simple and clean. With only one, but really big problem - performance.
The above request works for 31 seconds! on top of a collection of tests with only 8M documents. We experimented with different types of indexes, storage systems (with and without rockdb), exploring explanation plans is futile. The test documents we use in this test are very short, with three short attributes.
We will be grateful for any input at this moment. Or we are doing something wrong. Or ArangoDB is simply not designed to run in that particular area.
btw, the ultimate goal would be to run something like the following the second time:
LET docs = (FOR a IN Asset FILTER a.name like 'test-asset-%' SORT a.name RETURN a) LET attribute1 = ( FOR a in docs COLLECT attr = a.attribute1 INTO g RETURN { value: attr, count: length(g[*])} ) LET attribute2 = ( FOR a in docs COLLECT attr = a.attribute2 INTO g RETURN { value: attr, count: length(g[*])} ) LET attribute3 = ( FOR a in docs COLLECT attr = a.attribute3 INTO g RETURN { value: attr, count: length(g[*])} ) LET attribute4 = ( FOR a in docs COLLECT attr = a.attribute4 INTO g RETURN { value: attr, count: length(g[*])} ) RETURN { counts: (RETURN { total: LENGTH(docs), offset: 2, to: 4, facets: { attribute1: { from: 0, to: 5, total: LENGTH(attribute1) }, attribute2: { from: 5, to: 10, total: LENGTH(attribute2) }, attribute3: { from: 0, to: 1000, total: LENGTH(attribute3) }, attribute4: { from: 0, to: 1000, total: LENGTH(attribute4) } } }), items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}), facets: { attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a), attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a), attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a), attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a) } }
Thanks!