I am trying to define important terms in a set of government documents. Terminal frequency generation is not a problem.
For the frequency of documents, I was hoping to use handy Python scripts and related data , which Peter Norwig posted for his chapter in Fine Data, which includes the frequencies of unigrams in a huge body of data from the Internet.
My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing the term, not the number of complete words that are this term, which we get from Norvig script. Can I use this data for raw tf-idf operation?
Here are some sample data:
word tf global frequency china 1684 0.000121447 the 352385 0.022573582 economy 6602 0.0000451130774123 and 160794 0.012681757 iran 2779 0.0000231482902018 romney 1159 0.000000678497795593
Just dividing tf into gf gives a βhigher score than economics,β which may not be right. Maybe there is some basic math?
source share