Thumbnail Sketches / Flyoletin-Martin Sketches
Flajolet, P./Martin, G. (1985): Probabilistic Counting Algorithms for Database Applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.
Durand, M./ Flyajolet, P. (2003): Loglog High Power Counting, in: Springer LNCS 2832, ESA Algorithms 2003, pp. 605-617.
Sketch sketches are used to count the number of individual elements in a set.
Given:
- bit array B [] of length l
- a (unique) hash function h (), which maps to [0,1, ... 2 ^ l)
- function r (), which gives the position of the least significant 1-bit in the binary representation of its input (for example, 000101 returns 1, 001000 returns 4)
insert x element:
- pn: = h (x) returns a pseudo random number
- apply r (pn) to set the position of the bitmap to 1 since the output h () is pseudo-random, each bit i is set to 1 ~ n / (2 ^ (i + 1)) times
the number of different elements in the set:
- find the position p of the rightmost 0 in the bit array
- p = log2 (n), solve for n to get the number of different elements in the set; the result can be up to 1.83 values.
using:
- in the field of data mining, P2P / distributed applications, document frequency estimation, etc.
Flower filters
Bloom, H. (1970): Spatial / temporal trade-offs in hash coding with acceptable errors, in: ACM Messages, vol. 13, No. 7 (July 1970), pp. 422-426.
Bloom filters are used to check if an item is a member of a collection .
Given:
- bit array B [] of length m
- k different hash functions h_k (), which are mapped to [0, ..., m-1], that is, from one of the positions of the m-bit array
insert x element:
- apply h_k to x (h_k (x)), for all k, i.e. you get the values โโof k
- set the resulting bits in array B to 1 (if already set to 1, do not change anything)
check if y is in the set:
- get p_k positions for verification using all hash functions h_k (h_k (y)), i.e. for each h_k function you get the position p_k
- if one of the positions p_k is set to 0 in the array B, the element y is finally not included in the set
- if all positions given by p_k are 1, then the element y can (!) be in the set
- false positive rate is approximately equal to (1 - e ^ (- kn / m)) ^ k, there are no false negatives!
- by increasing the number of hashing functions, the false positive rate can be reduced; however, at the same time, your flowering filter becomes slower; the optimal value of k is k = (m / n) ln (2)
using:
- at the beginning it is used as a cheap filter in databases to filter items that do not match the query
- various applications today, for example. on Google BigTable, but also online for searching the Internet, etc.
source share