Inspired by this answer, I try to find the cosine similarity between the trained tf-idf instructor and the new document, and return similar documents.
In the code below, the cosine convergence of the first vector is found , not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since the data on my train is huge, looping through an entire trained vectorizer sounds like a bad idea. How can I infer the vector of a new document and find the relevant documents, the same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])
source
share