Multi-color filters and FM sketches

What is the difference between flowering filters and hash thumbnails (also FM thumbnails) and what is their use?

+4
source share
2 answers

Thumbnail Sketches / Flyoletin-Martin Sketches

Flajolet, P./Martin, G. (1985): Probabilistic Counting Algorithms for Database Applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.

Durand, M./ Flyajolet, P. (2003): Loglog High Power Counting, in: Springer LNCS 2832, ESA Algorithms 2003, pp. 605-617.

Sketch sketches are used to count the number of individual elements in a set.

Given:

  • bit array B [] of length l
  • a (unique) hash function h (), which maps to [0,1, ... 2 ^ l)
  • function r (), which gives the position of the least significant 1-bit in the binary representation of its input (for example, 000101 returns 1, 001000 returns 4)

insert x element:

  • pn: = h (x) returns a pseudo random number
  • apply r (pn) to set the position of the bitmap to 1 since the output h () is pseudo-random, each bit i is set to 1 ~ n / (2 ^ (i + 1)) times

the number of different elements in the set:

  • find the position p of the rightmost 0 in the bit array
  • p = log2 (n), solve for n to get the number of different elements in the set; the result can be up to 1.83 values.

using:

  • in the field of data mining, P2P / distributed applications, document frequency estimation, etc.

Flower filters

Bloom, H. (1970): Spatial / temporal trade-offs in hash coding with acceptable errors, in: ACM Messages, vol. 13, No. 7 (July 1970), pp. 422-426.

Bloom filters are used to check if an item is a member of a collection .

Given:

  • bit array B [] of length m
  • k different hash functions h_k (), which are mapped to [0, ..., m-1], that is, from one of the positions of the m-bit array

insert x element:

  • apply h_k to x (h_k (x)), for all k, i.e. you get the values โ€‹โ€‹of k
  • set the resulting bits in array B to 1 (if already set to 1, do not change anything)

check if y is in the set:

  • get p_k positions for verification using all hash functions h_k (h_k (y)), i.e. for each h_k function you get the position p_k
  • if one of the positions p_k is set to 0 in the array B, the element y is finally not included in the set
  • if all positions given by p_k are 1, then the element y can (!) be in the set
  • false positive rate is approximately equal to (1 - e ^ (- kn / m)) ^ k, there are no false negatives!
  • by increasing the number of hashing functions, the false positive rate can be reduced; however, at the same time, your flowering filter becomes slower; the optimal value of k is k = (m / n) ln (2)

using:

  • at the beginning it is used as a cheap filter in databases to filter items that do not match the query
  • various applications today, for example. on Google BigTable, but also online for searching the Internet, etc.
+4
source

The Bloom filter is a data structure used to search for membership, while FM Sketch is mainly used to count elements. These two data structures provide appropriate solutions that optimize the space needed to perform the search / calculation, and the tradeoff is the accuracy of the result.

0
source

Source: https://habr.com/ru/post/1444785/


All Articles