Creating a tag cloud with solr

Expensive stackoverflow community:

Given some text, I want to get the TOP-50 of the most frequently occurring words in the text and create a tag cloud from it, and thus show the essence of what the text is about in graphical form.

text is a collection of 100 or so comments. Each element (image) contains about 120 elements, and I also want the cloud to be updated - keeping comments indexed, and using the cloud generation code to run each time a new web request appears.

I decided to use Solr to index the text and now I am wondering how to get the TOP 50 words from the Solr TermsVectorComponant . The following is an example of the results returned by the term vector componentant after you turn on the frequency of the term by saying tv.tf="true" :

  <lst name="doc-5"> <str name="uniqueKey">MA147LL/A</str> <lst name="includes"> <lst name="cabl"><tf>5</tf></lst> <lst name="earbud"><tf>3</tf></lst> <lst name="headphon"><tf>10</tf></lst> <lst name="usb"><tf>11</tf></lst> </lst> </lst> <lst name="doc-9"> <str name="uniqueKey">3007WFP</str> <lst name="includes"> <lst name="cabl"><tf>5</tf></lst> <lst name="usb"><tf>4</tf></lst> </lst> </lst> 

As you can see, I have 2 problems:

  • I get all the conditions inside the document for this field, and not just for 100
  • And they are not sorted by frequency, so I have to get the conditions and sort them in memory to do what I'm trying to do.

Is there a better way? (or) Can I tell the solr termvector component to somehow sort it and pick only 100 for me? (or) Is there any other framework I can use? I need new comments to be indexed as they appear, so the tag cloud is always out of date. As for the cloud generator, it takes a dictionary of weighted words and makes it enjoyable.

This answer does not help.

EDIT - try out jpountz and paige cook answer

Here is the result I got for this query:

  select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true &facet.field=Post_Content&facet.minCount=1&facet.limit=50 <int name="also">1</int> <int name="ani">1</int> <int name="anoth">1</int> <int name="atleast">1</int> <int name="base">1</int> <int name="bcd">1</int> <int name="becaus">1</int> <int name="better">1</int> <int name="bigger">1</int> <int name="bio">1</int> <int name="boot">1</int> <int name="bootabl">1</int> <int name="bootload">1</int> <int name="bootscreen">1</int> 

I have 50 such elements, thanks @jpountz for helping to limit the results, BUT why does ALL FIFTY of individual <int> elements contain a value of 1 ? My thoughts: the number 1 represents the number of documents matching my query (which can only be one since I requested Id: Guid) and they do not represent the word frequency in Post_Content

To prove this, I removed the Id: GUID from the request, and the result was:

 <int name="content">33</int> <int name="can">17</int> <int name="on">16</int> <int name="so">16</int> <int name="some">16</int> <int name="all">15</int> <int name="i">15</int> <int name="do">14</int> <int name="have">14</int> <int name="my">14</int> 

My problem is how to get the frequency of a term in a document, and not the frequency of a document of many terms. For example, I know that a boot word is a word that I used 6 times in Post_content. Therefore, I want sorted pairs, for example (6, "boot"), (5, "disk") for a set of documents.

+6
source share
3 answers

I came up with a STOPGAP solution: (I call each solr document "mail" for an example)

Solr has a term component whose purpose is to expose all indexed members of any given field. It is mainly used to implement functions such as autocomplete and other functions that work at the term level. And by default, it is sorted by frequency - more often the first terms in the field appear in the field.

I created a dynamic field called content_ and indexed each post-set in my own field based on the category. This means that there will be hundreds of instances of the dynamic field, each of which contains one post-set, and I can use the term component in this field to get TOP TERMS for this post-set.

As picture:

 content_postSetOne : contains indexed version of a set of posts content_postSetTwo : contains indexed version of another set of posts content_postSetThree : contains indexed version of a third set of posts 

This solution works for me, and you can easily create a field per message if necessary. I am also interested in knowing the consequences of using dynamic fields like this: will this be a problem?

How is this different from Paige and jPountz answer:

  • The term “frequency” is the number of words in “A” or “Document Set”, not the number of documents containing this term.
  • I can get the top incoming terms from one document and, if necessary, also from a set of documents.
  • I did not use the cut, because it primarily gives the frequency in terms of the number of documents, and not in the number of times the word occurred no matter what document.
0
source

Here's an article describing setting up a tag cloud - Creating a tag cloud with Solr and PHP . Although part of PHP may not be applicable to you, the actual generation of the tag cloud, in my opinion, is ...

This article describes a method for creating a text field using the space tokenizer to return individual words and then perform a facet search in this field. I know that you can set the boundaries of the face, so in your case you can get only the 100 best results.

+5
source

If the Lucene document is a comment, you can use the cut to do this. For example, the following request http://solr:port/solr/select?q={!lucene}uniqueKey:(MA147LL/A OR 3007WFP)&facet=true&facet.field=includes&facet.minCount=1&facet.limit=50 will help you create a cloud tags for comments MA147LL/A and 3007WFP .

However, this approach:

  • make Solr creates an instance of UnInvertedField for the includes field, which requires memory,
  • Count the number of documents that match the term, not the total number of occurrences of that term.
+1
source

Source: https://habr.com/ru/post/896699/


All Articles