Expensive stackoverflow community:
Given some text, I want to get the TOP-50 of the most frequently occurring words in the text and create a tag cloud from it, and thus show the essence of what the text is about in graphical form.
text is a collection of 100 or so comments. Each element (image) contains about 120 elements, and I also want the cloud to be updated - keeping comments indexed, and using the cloud generation code to run each time a new web request appears.
I decided to use Solr to index the text and now I am wondering how to get the TOP 50 words from the Solr TermsVectorComponant . The following is an example of the results returned by the term vector componentant after you turn on the frequency of the term by saying tv.tf="true" :
<lst name="doc-5"> <str name="uniqueKey">MA147LL/A</str> <lst name="includes"> <lst name="cabl"><tf>5</tf></lst> <lst name="earbud"><tf>3</tf></lst> <lst name="headphon"><tf>10</tf></lst> <lst name="usb"><tf>11</tf></lst> </lst> </lst> <lst name="doc-9"> <str name="uniqueKey">3007WFP</str> <lst name="includes"> <lst name="cabl"><tf>5</tf></lst> <lst name="usb"><tf>4</tf></lst> </lst> </lst>
As you can see, I have 2 problems:
- I get all the conditions inside the document for this field, and not just for 100
- And they are not sorted by frequency, so I have to get the conditions and sort them in memory to do what I'm trying to do.
Is there a better way? (or) Can I tell the solr termvector component to somehow sort it and pick only 100 for me? (or) Is there any other framework I can use? I need new comments to be indexed as they appear, so the tag cloud is always out of date. As for the cloud generator, it takes a dictionary of weighted words and makes it enjoyable.
This answer does not help.
EDIT - try out jpountz and paige cook answer
Here is the result I got for this query:
select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true &facet.field=Post_Content&facet.minCount=1&facet.limit=50 <int name="also">1</int> <int name="ani">1</int> <int name="anoth">1</int> <int name="atleast">1</int> <int name="base">1</int> <int name="bcd">1</int> <int name="becaus">1</int> <int name="better">1</int> <int name="bigger">1</int> <int name="bio">1</int> <int name="boot">1</int> <int name="bootabl">1</int> <int name="bootload">1</int> <int name="bootscreen">1</int>
I have 50 such elements, thanks @jpountz for helping to limit the results, BUT why does ALL FIFTY of individual <int> elements contain a value of 1 ? My thoughts: the number 1 represents the number of documents matching my query (which can only be one since I requested Id: Guid) and they do not represent the word frequency in Post_Content
To prove this, I removed the Id: GUID from the request, and the result was:
<int name="content">33</int> <int name="can">17</int> <int name="on">16</int> <int name="so">16</int> <int name="some">16</int> <int name="all">15</int> <int name="i">15</int> <int name="do">14</int> <int name="have">14</int> <int name="my">14</int>
My problem is how to get the frequency of a term in a document, and not the frequency of a document of many terms. For example, I know that a boot word is a word that I used 6 times in Post_content. Therefore, I want sorted pairs, for example (6, "boot"), (5, "disk") for a set of documents.