Given that we have the following Neo4j scheme (simplified, but it shows an important point). There are two types of NODE
and VERSION
nodes. VERSION
are associated with NODE
using the VERSION_OF
relationship. VERSION
nodes have two properties from
and until
, which denote action time intervals - either both, or both can be NULL
(non-existent in Neo4j terms) to mean unlimited. NODE
can be connected through the HAS_CHILD
connection. Again, these relationships have two from
and until
properties that denote action time intervals - either both, or both can be NULL
(non-existent in Neo4j terms) to mean unlimited.
EDIT . The validity dates on the VERSION
nodes and the HAS_CHILD
relationship are independent (although the coincidence example shows their alignment).
The example shows two NODE
A and B. A has two VERSION
AV1s up to 6/30/17 and AV2 starting on 7/1/17, and B has only one version of BV1 , which is not limited. B connects to A using the HAS_CHILD
relationship until 6/30/17.
The challenge now is to request a graph for all nodes that are not children (which are root nodes) at a particular point in time. In the above example, the query should return only B if the request date is, for example. 6/1/17, but it should return B and A if the request date is, for example. 8/1/17 (since A is no longer a child of B since 7/1/17).
The current query today looks something like this:
MATCH (n1:NODE) OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION) WHERE (c.from <= {date} <= c.until) AND (nv2.from <= {date} <= nv2.until) WITH n1 WHERE c IS NULL MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION) WHERE nv1.from <= {date} <= nv1.until RETURN n1, nv1 ORDER BY toLower(nv1.title) ASC SKIP 0 LIMIT 15
This query works relatively well overall, but it starts to slow down when used on large data sets (comparable to real data sets). With 20-30k NODE
(and about twice as much VERSION
s) (real), the request takes about 500-700ms on a small docker container running on Mac OS X), which is acceptable. But with 1.5M NODE
(and about twice the number of VERSION
s) (real) the request takes a little more than 1 minute on a white metal server (works only with Neo4j). This is not very acceptable.
Do we have the ability to customize this request? Are there more efficient ways to control version of NODE
(which I doubt the performance problem is here) or is the relationship right? I know that the properties of relationships cannot be indexed, so a scheme for processing the validity of these relationships may be better.
Any help or even the slightest hint is appreciated.
EDIT after reply from Michael Hunger :
The percentage of root nodes:
In the current sample dataset (1.5M nodes), the result set contains about 2 thousand rows. This is less than 1%.
ITEM_VERSION
node at the beginning of MATCH
:
We use ITEM_VERSION
nv2
to filter the result set into ITEM
nodes that do not have communication with other ITEM
nodes on a given date. This means that either no relationship must exist that is valid for a given date, or the related item must not have ITEM_VERSION
that is valid for a given date. I am trying to illustrate this:
// date 6/1/17 // n1 returned because relationship not valid (nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...) // n1 not returned because relationship and connected item n2 valid (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...) // n1 returned because connected item n2 not valid even though relationship is valid (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)
Using relationship types:
The problem is that the software has a user schema, and the ITEM
nodes are associated with custom relationship types. Since we cannot have multiple types / labels in relation to a relationship, the only common characteristic for these relationships is that they all start with X_
. This has been excluded from the simplified example here. Will there be a search with the predicate type(r) STARTS WITH 'X_'
here?