Python nltk.sent_tokenize ascii codec error cannot decode

I could successfully read the text in a variable, but when trying to fake texts, getting this strange error:

sentences=nltk.sent_tokenize(sample) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128) 

I know that the cause of the error is some special line / char, which the token is not able to read / decode, but then how to get around this? Thanks

+4
source share
2 answers

You should try the following:

 sentences=nltk.sent_tokenize(sample.decode('utf-8')) 
+22
source

In short, the pos_tag NLTK3 function does not work.

NLTK2 function works fine.

pip uninstall nltk

pip install http://pypi.python.org/packages/source/n/nltk/nltk-2.0.4.tar.gz

On the other hand, the tagger is pretty bad (apparently, the โ€œconservatoryโ€ is a verb). I want Spike to work on Windows.

0
source

Source: https://habr.com/ru/post/1207974/


All Articles