Scikit TfidfVectorizer.transform () returns variable results for a single document

I am new to sckit-learn and am confused because TfidVectorizer sometimes returns a different vector for the same document.

My body contains> 100 documents.

I run:

 vectorizer = TfidfVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1) X = vectorizer.fit_transform(corpus) 

to initialize TfidVectorizer and place it in documents in the enclosure. corpus is a list of text strings.

Subsequently, if I do this:

 test = list(vectorizer.transform([corpus[0]]).toarray()[0]) test == list(X.toarray()[0]) 

Result False .

If I print the first 20 elements of list(X.toarray()[0]) and test , respectively, you will see that they are disconnected by a fraction when I expect them to be the same.

 [0.16971458376720741, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 

vs.

 [0.16971458376720716, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 

But if I do this:

 test_1 = list(vectorizer.transform([corpus[0]).toarray()[0]) test_2 = list(vectorizer.transform([corpus[0]).toarray()[0]) test_1 == test_2 

The result is True . Above, I essentially computes the vector twice, which is what I thought in the first example (since X contains the vectors returned during fit_transform ).

Why are the vectors different in my first example? Am I something wrong here?

+5
source share
1 answer

As mentioned in the commentary, this is most likely a rounding error, and probably this is not worth the worry.

However, I think it's worth trying to understand the phenomenon.

This is probably a rounding error. Sometimes these errors happen because the numbers on your computer do not have infinite precision: a typical numpy float will be stored at 64 bits.

The fact that they have finite precision means that the addition is no longer associative: a + (b + c) is not always exact (a + b) + c.

Try showing this behavior in action:

 import numpy as np a = np.random.random(size=1000000) print(a.dtype) print("%.15f" % a.sum()) b = np.random.permutation(a) print("%.15f" % b.sum()) 

Output:

 float64 500399.674621732032392 500399.674621731741354 

Now, if we add the above script to try with 32 bit floats:

 a = a.astype(np.float32) print(a.dtype) print("%.15f" % a.sum()) b = np.random.permutation(a) print("%.15f" % b.sum()) 

We get:

 float64 500214.871674167399760 500214.871674167283345 float32 500214.937500000000000 500215.000000000000000 

You can see that the error is much higher: because 32-bit floats are less accurate than 64-bit floats.

Now, if you think this is awesome and you want to know more, numpy gives you detailed information about storing floats using the np.finfo function:

 In [10]: np.finfo(np.float32) Out[10]: finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32) 

Sorry, I did not answer your question;). Maybe the cause of the error in your case is not exactly what I explained, I am writing this because I think that if you were familiar with these points, you would not have asked this question in the first place.

Hope this helps!

0
source

Source: https://habr.com/ru/post/1240816/


All Articles