The necessary algorithm for quick storage and search (search) of sets and subsets

I need a way to store arbitrary size sets for quick query later. I will need to query the resulting data structure for subsets or sets that are already stored.

=== Later edit: To clarify, the accepted answer to this question would be a reference to a study that proposes a solution to this problem. I do not expect people to develop the algorithm themselves. I was looking at the tuple clustering algorithm found here , but this is not exactly what I want, because from what I understand, it is the "clusters" of tuples in simpler, discrete / approximate forms and loses the original tuples.

Now, an even simpler example:

[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]

Query:

[alpha, delta]

Result:

[alpha, beta, gama, delta] [alpha, epsilon, delta]

Thus, set items are simply unique, unrelated items. Forget about types and values. Elements can be checked among them for equality and what they are. I am looking for an established algorithm (which probably has a title and a scientific article on it) more than just creating it now, in place.

== Initial examples:

For example, let's say that the database contains these sets

 [A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1] 

If I use [A1, C1] as a query, these two sets should be returned as a result:

 [A1, B1, C1, D1], [A1, D3, C1] 

Example 2:

Database:

 [Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red] [Distance to Berlin: 240km, car paint: blue, number of car seats: 2] [number of car seats: 2, Gasoline amount: 2L] 

Query:

 [Distance to berlin: 240km] 

Result

 [Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red] [Distance to Berlin: 240km, car paint: blue, number of car seats: 2] 

There can be an unlimited number of "fields" such as Gasoline amount . The solution is likely to be related to the grouping of the database and sets of links that have common states (for example, Gasoline amount: 240 ) so that the query is as efficient as possible.

What algorithms exist for such needs?

I hope that there is already an established solution to this problem, instead of just trying to find your own in place, which may not be as effective as checking and improving other people over time.

Explanations:

  • If this helps answer the question, I intend to use them to store the conditions: A simple example: [Has milk, does not contain eggs, has sugar]
  • I think that such a requirement may require graphics or multidimensional arrays, but I'm not sure.

Conclusion I implemented the two algorithms proposed in the answers, that is, the Set-Trie and Inverted Index, and did some rudimentary profiling on them. The duration of the request for a given set for each algorithm is illustrated below. Both algorithms worked on the same randomly generated dataset consisting of sets of integers. Algorithms seem equivalent (or almost) useful:

enter image description here

+6
source share
6 answers

I am sure that I can now contribute to the solution. One possible quite effective way is:

Trie invented by Frankling Mark Liang

Such a special tree is used, for example, to check spelling or autocomplete, and it actually approaches your desired behavior, especially allowing you to conveniently search for subsets.

The difference in your case is that you are not interested in the order of your attributes / functions. For your case, Set-Trie was created, which was invented by Iztok Savnik.

What is a set tree? A tree in which each node, except the root, contains one attribute value (number) and a marker (bool), if this node has data input. Each subtree contains only attributes whose values โ€‹โ€‹are greater than the value of the attribute of the parent node. Set-Tree root is empty. The search key is the path from the root to a specific node tree. The search result is a set of paths from the root to all nodes containing a marker, which you reach when you go through the tree and up the search key at the same time (see below).

But first, a drawing from me:

Simple set-trie drawing

Attributes {1,2,3,4,5}, which can be anything, but we just list them and, therefore, naturally get order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}}, which in the picture are a set of paths from the root to any circle . Circles are markers for the data in the image.

Note that the correct subtree from the root does not contain attribute 1 at all. This is the key.

Search including subsets Suppose that you want to search for attributes 4 and 1. First you order them, search key {1,4}. Now, starting with root, you simultaneously run the search key and down the tree. This means that you take the first attribute in the key (1) and look at all the child nodes whose attribute is less than or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all the child nodes, whose attribute value is less than 4, that is all. You continue until you have nothing to do and collect all the circles (data records) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4}, but not {1,3} (no 4) or {2,4} (no 1).

The insert is very simple. Go down the tree and save the data entry in the appropriate position. For example, a data record {2.5} will be stored as a child of {2}.

Add attributes dynamically . Naturally, you can immediately insert {1,4,6}. It will be below {1.4}, of course.

I hope you understand what I want to say about Set-Tries. The article by the Source of Savnik explains this in much more detail. They are probably very effective.

I do not know if you want to save the data in a database. I think this will complicate the situation, and I do not know what is best to do then.

+4
source

How to get an inverse index built from hashes?

Suppose you have int A , char B , bool C values โ€‹โ€‹of different types. Using std::hash (or any other hash function) you can create numerical values โ€‹โ€‹for the hash size_t Ah, Bh, Ch .

Then you define a map that maps the index into a vector of pointers to tuples

 std::map<size_t,std::vector<TupleStruct*> > mymap; 

or, if you can use global indexes, just

 std::map<size_t,std::vector<size_t> > mymap; 

To search for queries X and Y you need

  • get the hash value of the Xh and Yh
  • get matching "sets" from mymap
  • intersect the sets mymap[Xh] and mymap[Yh]
+2
source

If I understand your needs correctly, you need a multi-state data storage structure that retrieves information about combinations of these states.

If the states are binary (as in your examples: has milk / does not contain milk, has sugar / does not contain sugar) or can be converted to binary (possibly by adding more states), then you have lightning for your purpose: Raster image indices

Raster image indices can make such comparisons in memory, and literally it cannot compare in speed with these (ANDing bits are what computers can actually do faster).

http://en.wikipedia.org/wiki/Bitmap_index

Here's a link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086

Almost all SQL databases support Bitmap Indexing, and there are several possible optimizations for it (compression, etc.):

MS SQL: http://technet.microsoft.com/en-us/library/bb522541 (v = sql.105) .aspx

Oracle: http://www.orafaq.com/wiki/Bitmap_index

Edit: Apparently, the original research work on bitmap indexes is no longer available for free public access.
Links to recent literature on this subject:

+2
source

This problem is known in the literature as a subset query . This is equivalent to the โ€œpartial matchโ€ problem (for example: find all words in the dictionary matching A? PL? Where? Is the โ€œdon't careโ€ symbol).

One of the earliest results in this area is this article by Ron Rivest since 1976 1 . This 2 is a later document of 2002. Hopefully this will be enough to do more in-depth literature searches.

  • Rivest, Ronald L. "Partial matching search algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.

  • Charikar, Moses, Peter Indyk and Rina Panigrahi. "New algorithms for a subset of queries, partial matches, orthogonal range searches and related problems." Automata, languages โ€‹โ€‹and programming. Springer Berlin Heidelberg, 2002.451-462.

+1
source

This seems like a job problem for a graph database. You create a node for each set or subset and a node for each element in the set, and then associate the nodes with the Contains relationship. For instance:.

enter image description here

Now you put all the elements A, B, C, D, E in the index / hash table so that you can find the node at a constant time on the chart. Typical performance for a query [A, B, C] would be the order of the smallest node times the size of a typical set. For instance. to find {A, B, C], I find that the order of A is equal to one, so I look at all the sets A in S1, and then I check that it has all BC, since the order of S1 is 4, I have to do a total of 4 comparisons.

A pre-built chart database, such as Neo4j, has a query language and will give good performance. I would suggest that with typical orders of your database is small, that its performance far exceeds the algorithms based on given views.

+1
source

Hashing is usually an efficient method for storing and retrieving multidimensional data. The problem here is that the number of attributes is variable and potentially very large, right? I searched for it a bit and found Object Hashing on Wikipedia. The idea is basically this:

  • Build a fixed-length hash from each data entry (aka feature vector)
  • The hash length should be much less than the number of functions available. Length is important for performance.

On the wikipedia page, there is an implementation in pseudo-code (create a hash for each function contained in the record, and then increase the hash of the object-object in this index position (modulo) by one) and links to other implementations.

Also here on SO is the question of function hashing and among other links to the scientific article on Hashing Function for exploring multitasking on a large scale .

I cannot give a complete solution, but you did not want this. I am quite convinced that this is a good approach. You will have to play with the hash length, as well as with various hash functions (flowering filter is another keyword) to optimize the speed for your special case. There may also be even more effective approaches if, for example, search speed is more important than storage (perhaps balanced trees?).

0
source

Source: https://habr.com/ru/post/970305/


All Articles