TypeError: must be unicode, not str in NLTK

Question

TypeError: must be unicode, not str in NLTK

I am using python2.7, nltk 3.2.1 and python-crfsuite 0.8.4. I follow this page: http://www.nltk.org/api/nltk.tag.html?highlight=stanford#nltk.tag.stanford.NERTagger for the nltk.tag.crf module.

For starters, I just ran this

from nltk.tag import CRFTagger
ct = CRFTagger()
train_data = [[('dfd','dfd')]]
ct.train(train_data,"abc")

I tried it too

f = open("abc","wb")
ct.train(train_data,f)

but I get the following error:

  File "C:\Python27\lib\site-packages\nltk\tag\crf.py", line 129, in <genexpr>
    if all (unicodedata.category(x) in punc_cat for x in token):
TypeError: must be unicode, not str

+4

python nltk crf

Backtrack Jul 15 '16 at 9:20

source share

1 answer

tripleee · Accepted Answer · 2016-07-15T10:14:32+0000

In Python 2, regular quotes '...'or "..."create byte strings. To get Unicode strings, use the prefix ubefore the string, for example u'dfd'.

. . Backporting Python 3 open(encoding="utf-8") Python 2 ; open() io.open().

, unicode(); decode() .

Ned Batchelder "Pragmatic Unicode", ; http://nedbatchelder.com/text/unipain.html

TypeError: must be unicode, not str in NLTK

More articles: