The task of the interview is how to develop a program that will work with big data.

Yesterday I had an interview, and I had this question: there is a server and a database in which large text is stored (each text can> 10 MB). Any database query is expensive . You need to develop a program (server) that will take the text and say whether it is text in the database or not.

I ask for such a decision - Make a good hash of each text. Then create a set of all hashes. If the hash of the input text is not installed , then there is no such text, otherwise, go to the database and try to find this text with a hash and compare it deeper. The disadvantage of such a decision is that it can be a lot of texts in the database, so the set will be a little larger =) Maybe you know how to solve this correctly?

+4
source share
1 answer

As noted in the comments, the basic concept is to reduce the number of queries to the database by eliminating query texts that cannot be in the database.

As you think, hashing is a great way to do this - if you use a good hash function (i.e. with a low chance of collision), you will never need to actually compare the texts, since you ( almost certainly ) will never get false positives.

A hashing option (if memory becomes a problem) is to use the Bloom Filter : "a spatially efficient probabilistic data structure that is used to check whether an element is a member of a collection." This approach uses much less memory than simple hashing, due to some false positives (i.e., sometimes you need to compare text with a database to make sure you have a match). Bloom filters allow you to configure tradoff between memory and a false positive rate. They are used in NoSQL databases such as BigTable and Apache Cassandra .

Refresh , as David Schwartz rightly points out, the hash function matters, especially in the case when the significant correctness of the requests is contained in the database - see our discussion in the comments below. The same problem applies to Bloom filters, which are ideal if you want to restrict disk access to a small subset of files (for example), but not to completely remove disk / database access due to false positives. I originally mentioned MurmurHash3 with a speed test, but since it is not cryptographic, it is not suitable for this question. When I really implemented such approaches in the past, I used SHA1, which is cryptographic.

+3
source

Source: https://habr.com/ru/post/1448034/


All Articles