CountVectorizer (). Fits into scikit-learn Python gives memory error

I am working on a classification problem for classes 8, the Training set contains about 400,000 marked objects, I use CountVectorizer.fit () to vectorize the data, but I get a memory error, I tried to use the HashingVectorizer instead, but in vain.

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer()
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
+4
source share
1 answer

max_features, . , . () ~ 10 000 . HashVectorizer, , .

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer(max_features=10000)
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
+1

Source: https://habr.com/ru/post/1656253/


All Articles