How should I vectorize the following list of lists with scikit?

Question

How should I vectorize the following list of lists with scikit?

I would like to quote on scikit to find out a list that has lists. I go on a path where I have training texts that I read, and then I get something like this:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word')
vect_representation= vect.fit_transform(corpus)
print vect_representation.toarray()

And I get the following:

return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

Also the problem with this is the labels at the end of each document, how do I handle them to make the right classification?

+4

python scikit-learn machine-learning nlp

tumbleweed Dec 28 '14 at 3:45

source share

2 answers

tumbleweed · Answer 1 · 2014-12-29T00:37:31+0000

For everyone in the future, this will solve my problem:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]

from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(splited_labels_from_corpus)

And this is the result when I use the function .toarray():

[[0 0 1]
 [1 0 0]
 [0 1 0]]

Thanks guys,

D Volsky · Answer 2 · 2014-12-28T08:44:54+0000

. CountVectorizer, :

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
... split labels from texts
vect = CountVectorizer(analyzer='word')
vect_representation= map(vect.fit_transform,corpus)
...

TfidfVectorizer .

How should I vectorize the following list of lists with scikit?

More articles: