I am trying to implement a maximize expectations algorithm for clustering documents. I plan to use Lucene Term Vectors to look for similarities between the two documents. There are 2 types of EM algos using naive bayes: a multidimensional model and a polynomial. Simply put, a multidimensional model uses the frequencies of different words in documents, which a multidimensional model simply uses information about whether a word is present in the document (logical vector).
I know that the term vectors in Lucene stores the terms that are present in the current document, along with their frequencies. This is exactly what is needed for a multidimensional model.
But a multidimensional model requires the following: A vector that preserves the presence or absence of a specific term. Thus, all elements in all documents must be processed by this vector.
As an example:
doc1: the CONTENT field has the following terms: it is a world of pleasure.
doc2: the CONTENT field has the following terms: this wonderful world is full of sarcastic people.
Now the vector I need should be
<this is a world of pleasure, amazing full of sarcastic people> (it contains all the words in all documents)
for doc1 the value of this vector is <1 1 1 1 1 1 0 0 0 0>
for doc2 vakue this vector is <1 1 0 1 0 0 1 1 1 1>
Is there a way to generate such a boolean vector in Lucene?
source
share