How to preprocess text for embedding?

In the traditional "hot" representation of words as vectors, you have a vector of the same size as the power of your dictionary. To reduce dimensionality, stop words are usually removed, as well as applying constraint, lemmatization, etc., to normalize the functions that you want to perform on an NLP task.

I am having trouble understanding whether / how to preprocess text for embedding (e.g. word2vec). My goal is to use these word embeddings as functions for NN, to classify texts in topic A rather than topic A, and then extract events on them in documents on topic A (using the second NN).

My first instinct is to pre-process the removal of stop words, lemmatization, etc. But, when I learn a little more about NN, I understand that in relation to the natural language, the CBOW and skip-gram models actually require the entire set of words to be present - in order to predict a word from the context, one would need to know the actual context, and not the reduced form context after normalization ... right?). The actual sequence of POS tags seems to be the key to predicting words for humans.

I found a few recommendations on the Internet , but I’m still interested to know what the community thinks here:

  • Are there any recent generally accepted best practices regarding punctuation, generation, lemmatization, stop words, numbers, lowercase letters, etc.
  • If so, what are they? Is it generally better to process as little as possible, or more on the heavier side, to normalize the text? Is there a compromise?

My thoughts:

It’s better to remove punctuation (but, for example, in Spanish do not remove accents, because they convey contextual information), change the number of written numbers to numeric, do not build everything in order (useful for extracting an object), do not interrupt, do not lemme,

Does this sound right?

+4
source share
3 answers

. , , "". , , , . , ( , ), ( , , ).

+2

. , , . , , , , , , .

, .

, , . :

: -

https://arxiv.org/pdf/1707.01780.pdf

:

" , . , , , lemmatization , , , , . , , , ".

+2

, . - , -. , , , , . , Google , , - . . - : -)

+1

Source: https://habr.com/ru/post/1678284/


All Articles