CountVectorizer (). Fits into scikit-learn Python gives memory error

Question

CountVectorizer (). Fits into scikit-learn Python gives memory error

I am working on a classification problem for classes 8, the Training set contains about 400,000 marked objects, I use CountVectorizer.fit () to vectorize the data, but I get a memory error, I tried to use the HashingVectorizer instead, but in vain.

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer()
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

+4

python scikit-learn

IbrahimSharaf Sep 29 '16 at 16:11

source share

1 answer

gidim · Answer 1 · 2018-02-17T06:59:55+0000

max_features, . , . () ~ 10 000 . HashVectorizer, , .

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer(max_features=10000)
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

CountVectorizer (). Fits into scikit-learn Python gives memory error

More articles: