I am trying to write a text normalizer, and one of the main cases that needs to be handled turns something like 3.14into three point one fouror three point fourteen.
I am currently using the \$?\d+(\.\d+)?%?c template nltk.regexp_tokenize, which I believe should handle numbers, as well as currency and percentages. However, at the moment, something seems to $23.50be processing fine (it parses for ['$23.50']), but 3.14parses for ['3', '14']- the decimal point is discarded.
I tried to add a separate pattern \d+.\d+to my regex, but that didn't help (and shouldn't match my current pattern?)
Change 2 . I also found that the part %also does not work correctly - it 20%returns only ['20']. I feel something is wrong in my regex, but I tested it in Pythex and it seems perfect?
Change . Here is my code.
import nltk
import re
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
| \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
| [+/\-@&*] # special characters with meanings
'''
words = nltk.regexp_tokenize(line, pattern)
words = [string.lower(w) for w in words]
print words
Here are some of my test lines:
32188
2598473
26 letters from A to Z
3.14 is pi. <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']