I am new to sckit-learn and am confused because TfidVectorizer sometimes returns a different vector for the same document.
My body contains> 100 documents.
I run:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1) X = vectorizer.fit_transform(corpus)
to initialize TfidVectorizer and place it in documents in the enclosure. corpus is a list of text strings.
Subsequently, if I do this:
test = list(vectorizer.transform([corpus[0]]).toarray()[0]) test == list(X.toarray()[0])
Result False .
If I print the first 20 elements of list(X.toarray()[0]) and test , respectively, you will see that they are disconnected by a fraction when I expect them to be the same.
[0.16971458376720741, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
vs.
[0.16971458376720716, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
But if I do this:
test_1 = list(vectorizer.transform([corpus[0]).toarray()[0]) test_2 = list(vectorizer.transform([corpus[0]).toarray()[0]) test_1 == test_2
The result is True . Above, I essentially computes the vector twice, which is what I thought in the first example (since X contains the vectors returned during fit_transform ).
Why are the vectors different in my first example? Am I something wrong here?