How do I search engines work?

Consider the following search results:

OK Pages are indexed, he only needs to view the number and the first few elements in the index table, so the speed is understandable.

Now consider the following search with the AND operation:

It makes me tag :) How on earth do search engines get the result of operations so quickly And on gigantic data sets? I see the following two ways to complete the task, and both of them are terrible:

  • You search for "David." Take the giant pace table and search for “John” on it. HOWEVER, the temporary table is not indexed by John, so brute force searches are required. It simply will not be calculated in 0.25 seconds no matter what you have.
  • Indexing with all possible words such as "David John." then we encounter a combinatorial explosion in terms of the number of keys, and even Google does not have the ability to handle it.

And you can And together as many search phrases as you want , and you still get answers within 0.5 seconds! How?

+4
source share
4 answers

What Marcus wrote about the fact that Google processes the request on many machines in parallel, correctly.

In addition, there is a search for information that facilitates this work. The classic way to do this is to create an inverted index , which consists of posting lists - a list for each term of all documents containing this term is in order.

When searching for a query with two terms, conceptually, you should take the posting lists for each of the two terms ("david" and "john") and go through them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O (N). Of course, N is still huge, so it will be done on hundreds of machines in parallel.

In addition, there may be additional tricks. For example, if the documents with the highest rank were placed higher in the lists, then perhaps the algorithm could decide that it found the 10 best results without missing all the lists. Then he guesses the remaining number of results (depending on the size of these two lists).

+2
source

I think you are approaching the problem from the wrong angle.

Google does not have tables / indexes on the same machine. Instead, they significantly break down their data into their servers. Reports show that up to 1000 physical machines are involved in each individual request !

With so much processing power, it is “simple” (used extremely ironically) - the issue is to ensure that each machine does its job in a split second.

Reading about Google’s technology and infrastructure is very inspiring and highly educated. I recommend reading on BigTable , MapReduce and the Google File System .

Google has an archive of its publications , available with a lot of juicy information about their technologies. This metafilter topic also provides some insight into the sheer amount of hardware needed to run a search engine.

+1
source

I don’t know how Google does it, but I can say how I did it when the client needs something like this:

It starts with an inverted index, as described in Avi. This is just a list of tables for each word in each document, a document identifier, a word and a score for the relevance of a word in that document. (Another approach is to index each type of word individually along with its position, but in this case it is not required).

From there it is even simpler than describing Avi - there is no need to do a separate search for each term. Aggregate operations with basic databases can easily be done in one go:

SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2 ORDER BY total_score DESC 

This will return the identifiers of all documents that have ratings for both “David” and “John” (that is, both words appear), ordered with a certain approximation of relevance, and will take about the same time to complete regardless of whether how many or how few terms you are looking for, since the performance of IN not very dependent on the size of the target set, and it uses a simple count to determine if all terms were consistent or not.

Note that this simplified method simply adds a rating of “David” and “John” to determine overall relevance; he does not accept order / proximity / etc. names. Once again, I'm sure Google is doing this in its ratings, but my client did not need it.

+1
source

I did something similar to these years ago on a 16-bit machine. The dataset had an upper limit of about 110,000 records (it was a cemetery, so the final limit for burials), so I set up a series of raster images, each of which contains 128 K bits.

Searching for "david", as a result of which I set the corresponding bit in one of the bitmap images to show that the word "david" was in the record. Same thing for "john" in the second bitmap.

Then all you need to do is the binary 'and' of the two bitmaps, and the resulting bitmap file tells you which record numbers both "david" and "john" had in them. A quick scan of the resulting bitmap returns a list of records matching both terms.

This method will not work for Google, though, so consider that it costs $ 0.02.

0
source

Source: https://habr.com/ru/post/1302463/


All Articles