Saving Inverted Index

I know that inverted indexing is a good way to index words, but what confuses me is how the search engines actually store them? For example, if the word "google" appears in a document - 2, 4, 6, 8 with different frequencies, where should they be stored? Can a database table with a one-to-many relationship do any useful for storing them?

+5
source share
4 answers

It is highly unlikely that fully functional SQL-like databases are used for this purpose. Firstly, it is called an inverted index. because it is just an index. Each entry is only a link. Since non-relational databases and key value stores have emerged as a favorite topic regarding web technologies.

  • You have only one way to access data (by query word). That's why he called the index.
  • Each entry is a list / array / vector of links to documents, so each element of this list is very small. The only other information besides storing the document identifier would be to store the tf-idf score for each element.

How to use it:

If you have one query word ("google"), then you are looking at the inverted index in which that word appears (2,4,6,8 in your example). If you have tf-idf ratings, you can sort the results to report the best matching document first. Then you look at the documents referenced by document identifiers 2,4,6,8 and report their URL, as well as a fragment, etc. URL, snippets, etc., It is probably best to store in another table or keystore.

If you have multiple query words ("google" and "altavista"), you look up II for both query words and you get two lists of document identifiers (2,4,6,8 and 3,7,8, 11,19) . You take the intersection of both lists, which in this case is equal to (8), which is a list of documents in which both query words occur.

+4
source

It is a fair bet that each of the main search engines has its own technology for processing inverted indexes. It is also a moderately good bet that they are not based on standard relational database technology.

In the specific case of Google, it is reasonable to assume that the technology used is based on BigTable technology described in 2006 by Fay Chang et al. In Bigtable: a distributed storage system for structured data . There is little doubt that the system has evolved since then.

+2
source

A traditionally inverted index is written directly to a file and stored on disk somewhere. If you want to execute logical search queries (either the file contains all the words in the query or not), the messages may look like they are stored permanently in the file.

Term_ID_1: Frequency_N: Doc_ID_1, Doc_ID_2, Doc_ID_N.Term_ID_2: Frequency_N: Doc_ID_1, Doc_ID_2, Doc_ID_N.Term_ID_N: Frequency_N: Doc_ID_1, Doc_ID_2, Doc_ID_N

The term id is the identifier of the term, frequency is the number of documents the term is in (in other words, how long the list of transactions is), and the id of the document is the document containing this term.

Along with the index, you need to know where everything is in the file, so the mappings should also be stored somewhere in another file. For example, given term_id, the map should return a file position containing this index, and then you can search for that position. Since frequency_id is recorded in messages, you know how many doc_ids should read from the file. In addition, there must be mappings from identifiers with the actual name term / doc.

If you have a small use case, you can disable it using SQL, using blob for the list of transactions and independently handle the intersection upon request.

Another strategy for very little use is to use a term matrix.

+2
source

Possible Solution

One possible solution would be to use a positional index. This is basically an inverted index, but we are increasing it by adding more information. Learn more about this at Stanford NLP .

Example

Say the word "hello" in documents 1 and 3 at (3,5,6,200) and (9,10) respectively.

  • The main inverted pointer (note that there is no way to find the words freqs or no positions)

"hello" => [1,3]

  • Positional index (note that we not only have freqs for each document, but we also know exactly where this term appeared in the document)

"hello" => [1:<3,5,6,200> , 3:<9,10>]

Heads up

Will your index be much larger now? You are betting!

That's why it’s nice to compress the index. There are several options for compressing the posting list using space encoding and even more dictionary compression options using common string compression algorithms.

Related Readings

Index compression

Printing Profile Files

dictionary compression

0
source

Source: https://habr.com/ru/post/1202843/


All Articles