Calculating TF-IDF for words in documents in solr and java

I can easily get TF by counting the number of Term in the document and I want to know how to calculate the frequency of the document, i.e. the number of documents containing this term

What I have been doing so far is querying solr with a lot of rows and counting the results back, but this is a very expensive time and memory. I want to count only terms

SolrQuery q = new SolrQuery(); q.setQuery("tweet_text:"+kw); q.addField("tweet_text"); q.setRows(40000000); SolrDocumentList results = null ; try { QueryResponse rsp = solrServer.query(q); results = rsp.getResults(); } catch (SolrServerException e) { e.printStackTrace(); } ArrayList<String> tweets = new ArrayList<String>(); for (SolrDocument doc : results) { tweets.add(doc.getFieldValue("tweet_text").toString()); } 
+4
source share
1 answer

In SOLR, you can use the function query directly to query the docFreq shown here: http://wiki.apache.org/solr/FunctionQuery#docfreq ,

 q={!func}docFreq(tweet_text, kw) 

Note. Also on this page are documented methods for requesting functions to get tf, idf and termfreq, which may also be useful to you.


This is probably less relevant for this issue , retrospectively, but I will leave it for a while if it is useful to you.

IndexReader.docFreq (Term) can provide you with what you are looking for.

eg:

 reader.docFreq(new Term("tweet_text", kw));' 

IndexSearcher.docFreq (Term) is the same, by the way.

+6
source

Source: https://habr.com/ru/post/1447268/


All Articles