Search Jackrabbit for federated nodes

Question

Search Jackrabbit for federated nodes

I have tagged objects in the Jackrabbit repository (actually Adobe / Day CQ CRX, but I think this is Jackrabbit code):

asset: tags = A, B
- data of subsidiary assets 1: tags = A, C, E
- subsidiary asset data 2: tags = D, E

I want to request a connection with a set of parent attributes and one child, that is, "BC" will match the asset, because we have those that are in parent and child 1, but "CD" will not match, because no combination of parent and one child, which corresponds to this, since C and D are separated into separate child data nodes.

Is there a way to do this in Jackrabbit? We can write an XPath request

\\element(*, dam:Asset)[(@tags = 'C' or *\@tags='C') and (@tags = 'D' or *\@tags='D')]

but this will not work because XPath does not seem to guarantee that the child assets * match, i.e. this means that “any child has C / D” and therefore will match my asset because 1+ children in children C and 1+ have D. Instead, I could use JCR-SQL2

 SELECT * FROM dam:Asset as asset LEFT OUTER JOIN nt:unstructured as child ON ISCHILDNODE(child,asset) WHERE (asset.tags = 'C' or child.tags = 'C') AND (asset.tags = 'D' or child.tags = 'D')

but there is no SELECT DISTINCT in JCR-SQL2: if instead I search for "BE", I will get this asset twice because it matches both asset + child1 and asset + child2.

I could do a postprocess or query result in Java, i.e. filter out false positive matches for the first case or filter out duplicate results for the second case, but I'm nervous how this will affect the swap performance: I will need to scan more nodes than I need to weed out the bad nodes, and I will need to scan the lot to figure out the correct one size of the result for paging. This should be cheaper for the second case of SQL2, because if my query is ordered, I can identify duplicates based on the node path, and all duplicates will be sequential, so I can find the data that this page is worth, with cheap scanning, without reading the entire node for each result, but I don’t know how much it costs to scan all the results to count the pages even for a simple case only for the path.

Another option we have reviewed is to denormalize tags into a single node. In this case, in order to maintain the accuracy of the search, this would mean creating a new comb_tags attribute in each child node and perform all search queries with only a set of child nodes. However, this still suffers from a separate issue if we map two child nodes under the same asset.

Thanks for any suggestions. This is already a large instance, and it will need to be scaled. I saw other questions that say ModeShape is a JCR implementation that has SELECT DISTINCT , but I think that switching to ModeShape just for this should be a last resort if it is really possible to put CQ in ModeShape.

One of the ideas that we came up with is to compute each union of asset tags and child tags and combine the tags into one line, and then write each value as a multi-valued property of the asset, that is, + child1 = "ABCE" property and asset + child2 = "ABDE", so we get

asset: tags = A, B; tagUnions = "ABCE", "ABDE"

As long as we define a fixed order for combining tags into a string (for example, in alphabetical order), we can search for any combination using tagUnions LIKE '%B%C%' (except that I use the correct separators between tags in the real case). Although this will work, as far as we can see, I do not really like it: a potentially large number of tags for each asset + child, all with longer names than individual letters, which means that we will end up with long lines executing LIKE requests for all of them that probably cannot be indexed effectively.

Another example of this is creating a bitmask: defining A = 1, B = 2, etc., and therefore store a multi-valued integer array, and then do bitwise comparison. However, this is probably limited to 64 different tags, and since we have 1000+, I don’t think we can do it - even if the JCR supports bitwise operations, which I expect it will not.

So, I'm still looking for a clean database solution for this. You missed the generosity that I raised, but there are still tics, voices and thanks for any help.

+6

jcr jackrabbit crx cq5

Rup Mar 26 '12 at 11:49

source share

1 answer

MrGomez · Answer 1 · 2012-03-28T21:29:46+0000

From the Apache Jackrabbit mailing list :

Yes, unfortunately, federated queries are not supported. Any work on this area would be greatly appreciated.
Meanwhile, the best solution is probably to execute two separate queries and explicitly perform the union in the application code by combining the two sets of results.

So this is an option. Looking at the SQL that you provided:

but there is no SELECT DISTINCT in JCR-SQL2: if instead I search for "BE" I will get this asset twice because it matches asset + child1 and asset + child2.

I looked at the possible solutions supported by Jackrabbit, and went empty-handed. However, I agree with the solution presented here :

I did to make a simple SELECT with assigned ORDER BYs ... then every time I used a string, I made sure that this is not the same as the previous one :-)

(saved by Sics.)

While ORDER BY potentially questionable if you don't need sorting with database support, is there anything that prevents the construction of a hashset in your controller to restrict your results to only unique values using the JCR API?

Search Jackrabbit for federated nodes

More articles: