I need a way to store arbitrary size sets for quick query later. I will need to query the resulting data structure for subsets or sets that are already stored.
=== Later edit: To clarify, the accepted answer to this question would be a reference to a study that proposes a solution to this problem. I do not expect people to develop the algorithm themselves. I was looking at the tuple clustering algorithm found here , but this is not exactly what I want, because from what I understand, it is the "clusters" of tuples in simpler, discrete / approximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
Thus, set items are simply unique, unrelated items. Forget about types and values. Elements can be checked among them for equality and what they are. I am looking for an established algorithm (which probably has a title and a scientific article on it) more than just creating it now, in place.
== Initial examples:
For example, let's say that the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red] [Distance to Berlin: 240km, car paint: blue, number of car seats: 2] [number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red] [Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of "fields" such as Gasoline amount . The solution is likely to be related to the grouping of the database and sets of links that have common states (for example, Gasoline amount: 240 ) so that the query is as efficient as possible.
What algorithms exist for such needs?
I hope that there is already an established solution to this problem, instead of just trying to find your own in place, which may not be as effective as checking and improving other people over time.
Explanations:
- If this helps answer the question, I intend to use them to store the conditions: A simple example: [Has milk, does not contain eggs, has sugar]
- I think that such a requirement may require graphics or multidimensional arrays, but I'm not sure.
Conclusion I implemented the two algorithms proposed in the answers, that is, the Set-Trie and Inverted Index, and did some rudimentary profiling on them. The duration of the request for a given set for each algorithm is illustrated below. Both algorithms worked on the same randomly generated dataset consisting of sets of integers. Algorithms seem equivalent (or almost) useful:
