Neo4j Cypher query to search for nodes that are not connected too slowly

Question

Neo4j Cypher query to search for nodes that are not connected too slowly

Given that we have the following Neo4j scheme (simplified, but it shows an important point). There are two types of NODE and VERSION nodes. VERSION are associated with NODE using the VERSION_OF relationship. VERSION nodes have two properties from and until , which denote action time intervals - either both, or both can be NULL (non-existent in Neo4j terms) to mean unlimited. NODE can be connected through the HAS_CHILD connection. Again, these relationships have two from and until properties that denote action time intervals - either both, or both can be NULL (non-existent in Neo4j terms) to mean unlimited.

EDIT . The validity dates on the VERSION nodes and the HAS_CHILD relationship are independent (although the coincidence example shows their alignment).

The example shows two NODE A and B. A has two VERSION AV1s up to 6/30/17 and AV2 starting on 7/1/17, and B has only one version of BV1 , which is not limited. B connects to A using the HAS_CHILD relationship until 6/30/17.

The challenge now is to request a graph for all nodes that are not children (which are root nodes) at a particular point in time. In the above example, the query should return only B if the request date is, for example. 6/1/17, but it should return B and A if the request date is, for example. 8/1/17 (since A is no longer a child of B since 7/1/17).

The current query today looks something like this:

 MATCH (n1:NODE) OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION) WHERE (c.from <= {date} <= c.until) AND (nv2.from <= {date} <= nv2.until) WITH n1 WHERE c IS NULL MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION) WHERE nv1.from <= {date} <= nv1.until RETURN n1, nv1 ORDER BY toLower(nv1.title) ASC SKIP 0 LIMIT 15

This query works relatively well overall, but it starts to slow down when used on large data sets (comparable to real data sets). With 20-30k NODE (and about twice as much VERSION s) (real), the request takes about 500-700ms on a small docker container running on Mac OS X), which is acceptable. But with 1.5M NODE (and about twice the number of VERSION s) (real) the request takes a little more than 1 minute on a white metal server (works only with Neo4j). This is not very acceptable.

Do we have the ability to customize this request? Are there more efficient ways to control version of NODE (which I doubt the performance problem is here) or is the relationship right? I know that the properties of relationships cannot be indexed, so a scheme for processing the validity of these relationships may be better.

Any help or even the slightest hint is appreciated.

EDIT after reply from Michael Hunger :

The percentage of root nodes:
In the current sample dataset (1.5M nodes), the result set contains about 2 thousand rows. This is less than 1%.

ITEM_VERSION node at the beginning of MATCH :

We use ITEM_VERSION nv2 to filter the result set into ITEM nodes that do not have communication with other ITEM nodes on a given date. This means that either no relationship must exist that is valid for a given date, or the related item must not have ITEM_VERSION that is valid for a given date. I am trying to illustrate this:

 // date 6/1/17 // n1 returned because relationship not valid (nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...) // n1 not returned because relationship and connected item n2 valid (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...) // n1 returned because connected item n2 not valid even though relationship is valid (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)

Using relationship types:
The problem is that the software has a user schema, and the ITEM nodes are associated with custom relationship types. Since we cannot have multiple types / labels in relation to a relationship, the only common characteristic for these relationships is that they all start with X_ . This has been excluded from the simplified example here. Will there be a search with the predicate type(r) STARTS WITH 'X_' here?

+6

graph neo4j cypher

Stefan gehrig Dec 23 '16 at 11:23

source share

2 answers

Inversefalcon · Answer 1 · 2017-01-02T11:17:16+0000

I think a good start to improve would be to match on nodes using an index so that you can quickly get a smaller corresponding subset of nodes to search for. Your approach right now should every time check all your NODEs and all their relationships and patterns, which, as you discovered, will not scale with your data.

Currently, the only nodes in your graph with date / time parameters are your nodes: ITEM_VERSION, so let's start with them. You will need an index: ITEM_VERSION from and to properties for a quick search.

Zero values will be problematic for your queries, since any inequality against a null value returns null, and most workarounds for working with zeros (using COALESCE () or several AND / OR for null cases) seem to prevent the use of index search which is the point of my particular proposal.

I would recommend that you replace the null values with and until the min and max values are specified, which allows you to use the node search by index:

 MATCH (version:ITEM_VERSION) WHERE version.from <= {date} <= version.until MATCH (version)<-[:VERSION_OF]-(node:NODE) ...

This should at least provide quick access to a smaller subset of nodes at the beginning to continue your request.

Michael hunger · Answer 2 · 2017-01-17T21:57:14+0000

What version of Neo4j are you using.

What percentage of your 1.5M nodes will be found as roots in your approximate date, and if you have no limit, how much data is returned? Perhaps the problem is not in the match as much as in the sorting at the end?

I'm not sure why you had VERSION nodes in your first part, at least you don't describe them as appropriate for determining the root of a node.

You have not used relationship types.

 MATCH (n1:NODE) // matches 1.5M nodes // has to do 1.5M * degree optional matches OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until) WITH n1 WHERE c IS NULL // how many root nodes are left? // # root nodes * version degree (1..2) MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION) WHERE nv1.from <= {date} <= nv1.until // has to sort all those WITH n1, nv1, toLower(nv1.title) as title RETURN n1, nv1 ORDER BY title ASC SKIP 0 LIMIT 15

Neo4j Cypher query to search for nodes that are not connected too slowly

More articles: