Does PostgreSQL use tf-idf?

I would like to know if PostgreSQL 9.3 uses full-text search with the GIN / GiST index tf-idf (frequency of the reverse frequency document).

In particular, in my phrase columns I have a few more popular words, while some of them are very unique (i.e. names). I want to index these columns so that matching unique words are weighted above ordinary words.

+4
source share
3 answers

Not. Inside the ts_rank function, there is no built-in method for ranking results using their global (corpus) frequency. However, the ranking algorithm is ranked by frequency within the document:

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html

So, if I search for “dog | chihuahua”, the following two documents will have the same rank, despite the relatively low frequency of the word “chihuahua”:

"I want a dog" "I want a chihuahua" 

However, the next line will receive a ranking higher than the previous two lines above, because the document contains two words "dog":

 "dog lovers have an average of 1.5 dogs" 

In short: a higher frequency in the document leads to a higher rank, but a lower frequency in the case does not affect.

One caveat: text search ignores stop words, so you won’t combine super high frequency words such as "the", "a", "of", "for", etc. (if you configured your language correctly)

+3
source

No Postgres uses TF-IDF as a measure of similarity among documents.

ts_rank above if the document contains the terms of the request more often. It does not take into account the global frequency of the term.

ts_rank_cd above if the document contains the terms of the request closer and more often. It does not take into account the global frequency of the term.

There is an extension from smlar text search creators that allows you to calculate the similarity between arrays using TF-IDF. It also allows you to turn tsvectors into arrays and supports fast indexing.

+2
source

Basically. Details are described at http://www.postgresql.org/docs/9.1/static/textsearch-controls.html

The main problem is that the term “frequency” is not really something based on the body you are indexing, but rather, setting it in the dictionary. Therefore, it seems to me that while you are choosing the right language, you should be fine.

-one
source

Source: https://habr.com/ru/post/1497644/


All Articles