Color filter implementation

Using the Bloom filter, we will receive optimization of space. The cassandra frame also has a Bloom Filter implementation. But in detail, how is this optimization of space achieved?

+4
source share
5 answers

The flowering filter is not a "frame". This is more like an algorithm. The implementation is not very long.

Here in Java I tried (.jar, source code and JavaDoc are all available):

"Standalone Java implementations of hash and cuckoo flower filters" (you may want to google this if the following link no longer works):

http://lmonson.com/blog/?page_id=99

+3
source

You can understand how this saves space using this example: Suppose I work for Google on the Chrome team and want to add a function to the browser that notifies the user if the URL he entered is a malicious URL. So I have a dataset of about 1 million malicious URLs that is about 25 MB in size. Since the size is quite large (large compared to the size of the browser itself), I store this data on a remote server.

Case 1: I use a hash function with a hash table. I solve an efficient hash function and run all 1 million URLs through a hash function to get the hash keys. Then I create a hash table (array) where the hash key will give me an index to place this url. So, now that I have hashed and populated the hash table, I check its size. I saved all 1 million URLs in a hash table along with the keys. Therefore, the size is at least 25 MB. Due to its size, this hash table will be stored on the remote server. When the user comes in and enters the URL in the address bar, I need to check if he is malicious. Thus, I run the URL through a hash function (the browser itself can do this), and I get the hash key for that URL. Now I have to make a request to my remote server using this hash key to check if the specific URL in my hash table with this particular key is the same as the one entered by the user. If so, then it is malicious, and if not, then it is not malicious. Thus, each time a user enters a URL, a request must be sent to the remote server to check if it is a malicious URL. This will take a lot of time and thus make my browser slow.

Case 2: I use a flowering filter. The entire list of 1 million URLs is launched through a flowering filter using several hash functions, and the corresponding positions are marked as 1 in a huge array of 0. Suppose we want a false positive rate of 1% using a flowering filter calculator ( http://hur.st/bloomfilter?n=1000000&p=0.01 ), we get the size of the flowering filter requires only 1.13 MB. Such a small size is expected, although the size of the array is huge, we only save 1 or 0, not the URL, as in the case of a hash table. This array can be considered as a bit array. That is, since we have only two values, 1 and 0, we can set individual bits instead of bytes. This would reduce the space spent 8 times. This 1.13 MB color filter, due to its small size, can be stored in the web browser itself! Thus, when the user arrives and enters the URL, we simply apply the required hash functions (in the browser itself) and check all the positions in the color filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is NOT SPECIFIED in the list of malicious URLs, and the user can act freely. Thus, we did not call the server and, therefore, saved time. A value of 1 tells us that the URL MAY be included in the list of malicious URLs. In these cases, we make a call to the remote server, and there we can use some other hash function with some hash table, as in the first case, to get and check if the URL is valid. Since in most cases the URL is unlikely to be malicious, the small bloom filter in the browser displays this and therefore saves time by avoiding calls to the remote server. Only in some cases, if the flower filter tells us that the URL can be malicious, only in those cases we make a call to the server. This "MIGHT" is eligible 99%.

Thus, using a small flower filter in the browser, we saved a lot of time, since we do not need to make server calls for each URL entered.

+8
source

So, I saw this question before, and I used the tip above, and it turned out to be a way to slow me down. So I wrote my own. This is not entirely general, but I'm sure that if someone desperately wants to work like me, they themselves will make it more general :)

I used the hash-murmur implementation, which you can download here: http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/

Code: package uk.ac.cam.cl.ss958.SpringBoardSimulation;

import ie.ucd.murmur.MurmurHash; import java.util.BitSet; import java.util.Random; public class FastBloomFilter { private final BitSet bs; final int [] hashSeeds; final int capacity; public FastBloomFilter(int slots, int hashFunctions) { bs = new BitSet(slots); Random r = new Random(System.currentTimeMillis()); hashSeeds = new int[hashFunctions]; for (int i=0; i<hashFunctions; ++i) { hashSeeds[i] = r.nextInt(); } capacity = slots; } public void add(int value) { byte [] b = new byte[] { (byte)(value >>> 24), (byte)(value >>> 16), (byte)(value >>> 8), (byte)value}; for (int i=0; i<hashSeeds.length; ++i) { int h = MurmurHash.hash32(b, 4, hashSeeds[i]); bs.set(Math.abs(h)%capacity, true); } } public void clear() { bs.clear(); } public boolean mightContain(int value) { byte [] b = new byte[] { (byte)(value >>> 24), (byte)(value >>> 16), (byte)(value >>> 8), (byte)value}; for (int i=0; i<hashSeeds.length; ++i) { int h = MurmurHash.hash32(b, 4, hashSeeds[i]); if(!bs.get(Math.abs(h)%capacity)) { return false; } return true; } public static void main(String [] args) { FastBloomFilter bf = new FastBloomFilter(1000, 10); System.out.println("Query for 2000: " + bf.mightContain(2000)); System.out.println("Adding 2000"); bf.add(2000); System.out.println("Query for 2000: " + bf.mightContain(2000)); } } 
+2
source

You can use the Bloom filter based on the Redis server with the Redisson lib. Based on 128-bit HighwayHash . Here is an example:

 RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample"); // initialize bloom filter once with // expectedInsertions = 55000000 // falseProbability = 0.03 bloomFilter.tryInit(55000000L, 0.03); bloomFilter.add(new SomeObject(someStateHere1)); bloomFilter.add(new SomeObject(someStateHere2)); // does it contain object? bloomFilter.contains(new SomeObject(someStateHere3)); 
+1
source

I wrote a short article on introducing a flowering filter using Java 8 features that I hope are relevant to the issue of saving space. I sent a bit further to discuss how to beat a slice of a flowering filter collection when some search engines do this, which is important for efficiency when you have many color filters.

0
source

Source: https://habr.com/ru/post/1333573/


All Articles