How to handle numbers inside text strings when vectorizing words?

If I have a text string for vectorization, how do I handle numbers inside it? Or, if I feed the neural network with numbers and words, how can I save numbers as numbers?

I plan to make a dictionary of all my words ( as suggested here ). In this case, all lines will become arrays of numbers. How should I handle characters that are numbers? how to deduce a vector that does not mix the word index with the number character

Does the number of lines convert to information that I feed on the network?

+5
source share
3 answers

The link you specify assumes that everything that results from .split(' ') is indexed - words, but also numbers, possibly emoticons, aso. (I will take care of the punctuation marks anyway). If you do not have earlier knowledge about your data or your problem, you can start with this.

EDIT

An example is literally using your string and code:

 corpus = {'my car number 3'} dictionary = {} i = 1 for tweet in corpus: for word in tweet.split(" "): if word not in dictionary: dictionary[word] = i i += 1 print(dictionary) # {'my': 1, '3': 4, 'car': 2, 'number': 3} 
+2
source

Extending the discussion with @ user1735003 - Let's look at both ways of representing numbers:

  • Considering it as a string and considering it as another word and assigning it an identifier when creating a dictionary. Or
  • Converting numbers to actual words: '1' becomes "one", "2" as "two", etc.

Does he change context in the second? To test this, we can find a similarity between the two representations using word2vec . Grades will be high if they have a similar context.

For example, 1 and one have a similarity index of 0.17, 2 and two have a similarity index of 0.23 . They seem to suggest that the context of how they are used is completely different.

Considering numbers as another word, you do not change the context, but by doing any other transformation to these numbers, you cannot guarantee it better. Therefore, it is better to leave it untouched and consider it as another word.

Note Both word-2-vec and glove were trained, treating numbers as strings (case 1).

+1
source

The following article may be useful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf

In particular, page 7.

Before using the <unknown> , they will try to replace the alphanumeric combination of characters with name tags of common patterns, for example:

 FourDigits (good for years) 

I tried to implement it, and it gave great results.

0
source

Source: https://habr.com/ru/post/1269409/


All Articles