Find the most commonly used various terms

Imagine a graph database consisting of URLs and tags used to describe them. From this, we want to find which label sets are most often used together and determine which URLs belong to each identified set.

I tried to create a dataset that simplifies this problem as such in cypher:

CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })

Using this as a link ( an example of the neo4j console here ), we can look at it and visually determine what the most used tags are techand mice(the request for this is trivial), referring to 3 URLs. Most often, a pair of tags is used [tech, mice], since it (in this example) is the only pair shared by two URLs (u4 and u1). It is important to note that this tag pair is a subset of the matched URLs; it is not the whole set for both. There is no combination of 3 tags shared by any URLs.

How can I write a query cypherto determine which tag combinations are most often used together (either in pairs or in groups of size N)? Perhaps there is a better way to structure this data that would facilitate the analysis? Or is this problem not suitable for graphical DB? Tried a little try to figure it out, any help or thoughts would be appreciated!

+4
source share
2 answers

Sounds like a combinatorics problem.

// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U, 
     collect(distinct T) as TAGS 

// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available, 
// use the logarithm and exponent
//
WITH U, TAGS, 
     toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations

// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex

// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC 
// https://neo4j-contrib.imtqy.com/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex, 
     toInt(ceil(exp(log(2) * tagIndex))) as pw2
     call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,  
     value WHERE value > 0

// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex, 
     collect(TAGS[tagIndex]) as combination

// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls 
       ORDER BY freq DESC

I think it is best to compute and save a combination of tags using this algorithm during marking. And the request would be something like this:

MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC
+1
source

URL, tag.name ( , ). . , , URL .

MATCH (u:url)
WITH u
MATCH (u) - [:IS_ABOUT] -> (t:tag)
WITH u, t
ORDER BY t.name
WITH u, [x IN COLLECT(t)|x.name] AS tags
WITH DISTINCT tags
MATCH (u)
WHERE ALL(tag IN tags WHERE (u) - [:IS_ABOUT] -> (tag))
RETURN tags, count(u)
0

Source: https://habr.com/ru/post/1654740/


All Articles