NLTK regexp tokenizer doesn't play well with decimal point in regex

Question

NLTK regexp tokenizer doesn't play well with decimal point in regex

I am trying to write a text normalizer, and one of the main cases that needs to be handled turns something like 3.14into three point one fouror three point fourteen.

I am currently using the \$?\d+(\.\d+)?%?c template nltk.regexp_tokenize, which I believe should handle numbers, as well as currency and percentages. However, at the moment, something seems to $23.50be processing fine (it parses for ['$23.50']), but 3.14parses for ['3', '14']- the decimal point is discarded.

I tried to add a separate pattern \d+.\d+to my regex, but that didn't help (and shouldn't match my current pattern?)

Change 2 . I also found that the part %also does not work correctly - it 20%returns only ['20']. I feel something is wrong in my regex, but I tested it in Pythex and it seems perfect?

Change . Here is my code.

import nltk
import re

pattern = r'''(?x)    # set flag to allow verbose regexps
            ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
            | \w+([-']\w+)*        # words w/ optional internal hyphens/apostrophe
            | \$?\d+(\.\d+)?%?  # numbers, incl. currency and percentages
            | [+/\-@&*]         # special characters with meanings
            '''
    words = nltk.regexp_tokenize(line, pattern)
    words = [string.lower(w) for w in words]
    print words

Here are some of my test lines:

32188
2598473
26 letters from A to Z
3.14 is pi.                         <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']

+4

python regex tokenize nltk

Jessica yang Mar 04 '14 at 15:20

source share

2 answers

:

\b\$?\d+(\.\d+)?%?\b

: \b.

0

Stephan 04 . '14 15:41

Jerry · Accepted Answer · 2014-03-04T17:26:20+0000

Challenger:

\w+([-']\w+)*

\w+will match numbers, and since it’s not there ., it will only match 3in 3.14. Move the parameters so that they are \$?\d+(\.\d+)?%?in front of the specified part of the regular expression (so that the match in the number format matches first):

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

demo version of regex101

:

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''

NLTK regexp tokenizer doesn't play well with decimal point in regex

More articles: