Tf-idf using unigram frequency data from Google

I am trying to define important terms in a set of government documents. Terminal frequency generation is not a problem.

For the frequency of documents, I was hoping to use handy Python scripts and related data , which Peter Norwig posted for his chapter in Fine Data, which includes the frequencies of unigrams in a huge body of data from the Internet.

My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing the term, not the number of complete words that are this term, which we get from Norvig script. Can I use this data for raw tf-idf operation?

Here are some sample data:

word tf global frequency china 1684 0.000121447 the 352385 0.022573582 economy 6602 0.0000451130774123 and 160794 0.012681757 iran 2779 0.0000231482902018 romney 1159 0.000000678497795593 

Just dividing tf into gf gives a β€œhigher score than economics,” which may not be right. Maybe there is some basic math?

+4
source share
1 answer

As I understand it, the global frequency is equal to the "inverse total frequency" mentioned here by Robertson . From this article by Robertson:

 One possible way to get away from this problem would be to make a fairly radical re- placement for IDF (that is, radical in principle, although it may be not so radical in terms of its practical effects). .... the probability from the event space of documents to the event space of term positions in the concatenated text of all the documents in the collection. Then we have a new measure, called here inverse total term frequency: ... On the whole, experiments with inverse total term frequency weights have tended to show that they are not as effective as IDF weights 

According to this text, you can use the inverse global frequency as an IDF term, although it is coarser than the standard.

Also you are not enough to stop the words . Words such as, for example, are used in almost all documents, so they do not give any information. Before tf-idf, you must remove such stop words.

+3
source

Source: https://habr.com/ru/post/1491970/


All Articles