How to include dates and other priority information for clustering?

I want to copy text. I understand the concept of clustering text content from Maut in action:

  • display (int β†’ term) all members in the input file and save in the dictionary
  • convert all input documents to normalized sparse vector
  • do clustering

I want to copy the text, as well as other information such as date-time, location, people I was with. For example, I want documents made in a 10-day visit to be in a remote location for placement in a separate cluster.

I know that I have to write my own tool for creating vectors from date, place, tags and (natural) text. How do I approach this? Should I use the built-in tools to vectorize the text, and then integrate this output into my own vectors? How about weighting sizes?

+4
source share
1 answer

I cannot give you details of the implementation, as I am not sure, but I can help you with a piece of the puzzle. You will probably need some contextual analysis to extract entities (such as location, time / date, people’s names).

To do this, take a look at OpenNLP.

http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html

in particular, look at the POS tagger and file name.

Once you have extracted the relevant objects, you can "do something with them using the Mahout classification (after you have extracted enough entities to train your model), but I'm not sure.

luck

+1
source

Source: https://habr.com/ru/post/1482310/


All Articles