Using multiple functions with scikit-learn

I am working on text classification using scikit-learn. Everything works well with one function, but introducing multiple functions gives me errors. I think the problem is that I am not formatting the data as the classifier is expected.

For example, this works great:

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

But this:

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

dying with

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

at the preprocessing stage after calling classifier.fit (). I think the problem is that I am formatting the data, but I cannot figure out how to do this correctly.

feature1 and feature2 are English text strings, as is the target. I am using LabelEncoder () to encode a target that seems to be working fine.

Here is an example of what returns print datato give you an idea of ​​how it is formatted right now.

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]
+4
2

, - , - str ( .lower), (, str s).

, , , ?

data = df[['feature1', 'feature2']].values

df['target'].values

np.ndarray .

, , 1x1, singleton "" ndarray.

+2

/, .

data = np.append(df.feature1. df.feature2)
0

Source: https://habr.com/ru/post/1525514/


All Articles