Python - tf-idf predicts new document similarity

Question

Python - tf-idf predicts new document similarity

Inspired by this answer, I try to find the cosine similarity between the trained tf-idf instructor and the new document, and return similar documents.

In the code below, the cosine convergence of the first vector is found , not a new query

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

Since the data on my train is huge, looping through an entire trained vectorizer sounds like a bad idea. How can I infer the vector of a new document and find the relevant documents, the same as the code below?

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

+4

python scikit-learn machine-learning document-classification tf-idf

Shlomi schwartz 25 sept. '16 at 16:02

source share

3 answers

DJanssens · Answer 1 · 2016-09-26T10:49:39+0000

( tf-idf ) . , ( ) ElasticSearch - .

: . , , (), .

tf-idf , ( ) . , , , , .

elyase · Answer 2 · 2016-09-26T08:48:00+0000

gensim. :

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]

tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

:

doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]

( ):

sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

, , . , (Annoy, Falconn, NMSLIB).

Masoud · Answer 3 · 2016-09-25T16:29:57+0000

. Technic,

(, 50 ), ( , )
, New Document , (50 ) (, 2 )

With this Technic, you can reduce the number of cycles and improve performance, you can read more articles in some chapters of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Python - tf-idf predicts new document similarity

More articles: