Python scikit-learn Cosine Similarity error value: Failed to convert integer scalar

I am trying to create a cosine similarity matrix using textual descriptions of applications. The script below is first read in a csv data file (I can provide a data file if necessary) that contains two columns, one with two categories of applications, and the other with token descriptions for several applications in each of these two categories. The script then creates the tfidf matrix and tries to create the cosine similarity matrix.

I updated Anaconda 64 bit for Windows yesterday to make sure I have the latest versions of Python, numpy, scipy and scikit-learn.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os

print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])

for i in cats:
    print ()
    print ('prepping', i)
    d2 = data[data.category == i]
    descStem = d2.descStem.tolist()

    print ('vectorizing', i)
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
    print (tfidf_matrix.shape)

    print ('calculating cosine sim', i)
    cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)

script , tdidf_matrix.shape = (3119, 8217). , , tfidf_matrix.shape = (90327, 62863). 2 ^ 32.

Traceback (most recent call last):

File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>

runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)

ValueError: could not convert integer scalar

, , - , script 40 + .

print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()

print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)

, StackOverflow, , ...

+4

Source: https://habr.com/ru/post/1673442/


All Articles