This was due to the way I list the data. If I print the data (using another sample), you will see:
>>> import pandas as pd >>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'], ... 'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']}) >>> samples = [dict(enumerate(sample)) for sample in train] >>> samples [{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]
This is a list of dicts. We should do this instead:
>>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()] >>> train_as_dicts [{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'}, {'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'}, {'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}] Now we need to vectorize the dicts: >>> from sklearn.feature_extraction import DictVectorizer >>> vectorizer = DictVectorizer() >>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts) >>> vectorized_sparse <3x7 sparse matrix of type '<type 'numpy.float64'>' with 12 stored elements in Compressed Sparse Row format> >>> vectorized_array = vectorized_sparse.toarray() >>> vectorized_array array([[ 1., 0., 0., 1., 0., 1., 0.], [ 0., 1., 1., 0., 1., 1., 0.], [ 1., 0., 1., 1., 0., 0., 1.]]) To get the meaning of each column, ask the vectorizer: >>> vectorizer.get_feature_names() ['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']