Consequences of abuse of nltk word_tokenize (sent)

Question

Consequences of abuse of nltk word_tokenize (sent)

I am trying to split a paragraph into words. I have a wonderful nltk.tokenize.word_tokenize (sent) at hand, but help (word_tokenize) says: "This tokenizer is designed to work on a sentence at a time."

Does anyone know what can happen if you use it in a paragraph, i.e. max 5 sentences? I tried this in a few short paragraphs myself, and it seems to work, but this is hardly conclusive evidence.

+6

python nltk

Garrett disco Oct 15 '13 at 4:27

source share

2 answers

Try this type of hack:

 >>> from string import punctuation as punct >>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?" # Add spaces before punctuations >>> for ch in sent: ... if ch in punct: ... sent = sent.replace(ch, " "+ch+" ") # Remove double spaces if it happens after adding spaces before punctuations. >>> sent = " ".join(sent.split())

Then, most likely, the following code is what you also need to calculate the frequency =)

 >>> from nltk.tokenize import word_tokenize >>> from nltk.probability import FreqDist >>> fdist = FreqDist(word.lower() for word in word_tokenize(sent)) >>> for i in fdist: ... print i, fdist[i]

+1

alvas Oct 15 '13 at 14:40

source share

Michael0x2a · Accepted Answer · 2013-10-15T04:46:20+0000

nltk.tokenize.word_tokenize(text) is just a subtle wrapper function that calls the tokenize method of the tokenize instance, which apparently uses a simple regular expression to parse the sentence.

The documentation for this class states that:

This tokenizer assumes the text is already segmented by sentences. Any periods - except those at the end of the line - are considered part of the word to which they are attached (for example, for the abbreviation, etc.) and are not separately indicated.

The main tokenize method tokenize very simple:

 def tokenize(self, text): for regexp in self.CONTRACTIONS2: text = regexp.sub(r'\1 \2', text) for regexp in self.CONTRACTIONS3: text = regexp.sub(r'\1 \2 \3', text) # Separate most punctuation text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text) # Separate commas if they're followed by space. # (Eg, don't separate 2,500) text = re.sub(r"(,\s)", r' \1', text) # Separate single quotes if they're followed by a space. text = re.sub(r"('\s)", r' \1', text) # Separate periods that come before newline or end of string. text = re.sub('\. *(\n|$)', ' . ', text) return text.split()

Basically, what this method usually does is tokenize this period as a separate token if it falls to the end of the line:

 >>> nltk.tokenize.word_tokenize("Hello, world.") ['Hello', ',', 'world', '.']

Any periods that fall inside the line are indicated as part of the word, assuming it is an abbreviation:

 >>> nltk.tokenize.word_tokenize("Hello, world. How are you?") ['Hello', ',', 'world.', 'How', 'are', 'you', '?']

As long as this behavior is acceptable, you should be fine.

Consequences of abuse of nltk word_tokenize (sent)

More articles: