Keep text data ordering when vectorized

Question

Keep text data ordering when vectorized

I am trying to write a machine learning algorithm with scikit-learnthat parses a text and classifies it based on learning data.

An example of using text data taken directly from the documentation scikit-learnis used CountVectorizerto create a sparse array for how many times each word appears.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)

Unfortunately, this does not allow for any ordering of phrases. You can use larger ones ngrams( CountVectorizer(ngram_range=(min, max))) to view specific phrases, but this quickly increases the number of functions and is not even so great.

Is there a good way to deal with ordered text differently? I am definitely open to using the natural language parser ( nltk, textblobetc.) along with scikit-learn.

+4

python python-3.x scikit-learn order nltk

2Cubed Jul 30 '16 at 2:47

source share

1 answer

bpachev · Accepted Answer · 2016-07-30T04:37:00+0000

How about insert word2vec? This embedding of words in neural network vectors is context sensitive. This can provide a more sophisticated feature set for your classifier.

python word2vec - gensim. Gensim , , . , :

easy_install -U gensim pip install --upgrade gensim.

word2vec

import gensim

documents = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

model = gensim.models.Word2Vec(documents, min_count=1)
print model["survey"]

, "", .

Gensim , , .

Keep text data ordering when vectorized

More articles: