Lucene Term Vector Multivariate Bayes Model Waiting Maximization

I am trying to implement a maximize expectations algorithm for clustering documents. I plan to use Lucene Term Vectors to look for similarities between the two documents. There are 2 types of EM algos using naive bayes: a multidimensional model and a polynomial. Simply put, a multidimensional model uses the frequencies of different words in documents, which a multidimensional model simply uses information about whether a word is present in the document (logical vector).

I know that the term vectors in Lucene stores the terms that are present in the current document, along with their frequencies. This is exactly what is needed for a multidimensional model.

But a multidimensional model requires the following: A vector that preserves the presence or absence of a specific term. Thus, all elements in all documents must be processed by this vector.

As an example:

doc1: the CONTENT field has the following terms: it is a world of pleasure.

doc2: the CONTENT field has the following terms: this wonderful world is full of sarcastic people.

Now the vector I need should be

<this is a world of pleasure, amazing full of sarcastic people> (it contains all the words in all documents)

for doc1 the value of this vector is <1 1 1 1 1 1 0 0 0 0>

for doc2 vakue this vector is <1 1 0 1 0 0 1 1 1 1>

Is there a way to generate such a boolean vector in Lucene?

+3
source share
1 answer

, (, ), .

, . , .

Lucene/Solr, , , , .

+2

Source: https://habr.com/ru/post/1793682/


All Articles