Python nltk.sent_tokenize ascii codec error cannot decode

Question

I could successfully read the text in a variable, but when trying to fake texts, getting this strange error:

sentences=nltk.sent_tokenize(sample) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)

I know that the cause of the error is some special line / char, which the token is not able to read / decode, but then how to get around this? Thanks

+4

user4197202 Nov 30 '14 at 11:53

2 answers

In short, the pos_tag NLTK3 function does not work.

NLTK2 function works fine.

pip uninstall nltk
pip install http://pypi.python.org/packages/source/n/nltk/nltk-2.0.4.tar.gz

On the other hand, the tagger is pretty bad (apparently, the “conservatory” is a verb). I want Spike to work on Windows.

0

user3297367 Aug 12 '15 at 1:46

shalini · Accepted Answer · 2014-11-30T11:54:13+0000

You should try the following:

 sentences=nltk.sent_tokenize(sample.decode('utf-8'))