I am trying to convert a DataFrame Pandas to a NumPy array to create a model with Sklearn. I will simplify this problem.
>>> mydf.head(10) IdVisita 445 latam 446 NaN 447 grados 448 grados 449 eventos 450 eventos 451 Reescribe-medios-clases-online 454 postgrados 455 postgrados 456 postgrados Name: cat1, dtype: object >>> from sklearn import preprocessing >>> enc = preprocessing.OneHotEncoder() >>> enc.fit(mydf)
Traceback:
ValueError Traceback (most recent call last) <ipython-input-74-f581ab15cbed> in <module>() 2 mydf.head(10) 3 enc = preprocessing.OneHotEncoder() ----> 4 enc.fit(mydf) /home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y) 996 self 997 """ --> 998 self.fit_transform(X) 999 return self 1000 /home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y) 1052 """ 1053 return _transform_selected(X, self._fit_transform, -> 1054 self.categorical_features, copy=True) 1055 1056 def _transform(self, X): /home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy) 870 """ 871 if selected == "all": --> 872 return transform(X) 873 874 X = atleast2d_or_csc(X, copy=copy) /home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X) 1001 def _fit_transform(self, X): 1002 """Assumes X contains only categorical features.""" -> 1003 X = check_arrays(X, sparse_format='dense', dtype=np.int)[0] 1004 if np.any(X < 0): 1005 raise ValueError("X needs to contain only non-negative integers.") /home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options) 279 array = np.ascontiguousarray(array, dtype=dtype) 280 else: --> 281 array = np.asarray(array, dtype=dtype) 282 if not allow_nans: 283 _assert_all_finite(array) /home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order) 460 461 """ --> 462 return array(a, dtype, copy=False, order=order) 463 464 def asanyarray(a, dtype=None, order=None): ValueError: invalid literal for long() with base 10: 'postgrados'
Note IdVisita is an index here, and numbers may not be all in a row.
Any clues?
source share