Error: codec "utf8" cannot decode byte 0x80 at position 0: invalid start byte

Question

Error: codec "utf8" cannot decode byte 0x80 at position 0: invalid start byte

I am trying to do the following kaggle assignmnet . I am using gensim package to use word2vec. I can create a model and save it to disk. But when I try to upload the file back, I get the error message below.

-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py Traceback (most recent call last): File "prog_w2v.py", line 7, in <module> models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format header = utils.to_unicode(fin.readline()) File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode return unicode(text, encoding, errors=errors) File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find a similar question. But I could not solve the problem. My prog_w2v.py is below.

 import gensim import time start = time.time() models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) end = time.time() print end-start," seconds"

I am trying to create a model using the code here . It takes about half an hour to create a model. Therefore, I cannot run it many times to debug it.

+6

python character-encoding gensim word2vec kaggle

user168983 Dec 26 '14 at 17:25

source share

4 answers

batgirl · Answer 1 · 2015-05-12T21:52:06+0000

You are not downloading the file correctly. You should use load () instead of load_word2vec_format (). The latter is used when you train a model using C code and save the model in binary format. However, you do not save the model in binary format and train it with python. So you can just use the following code, and it should work:

 models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

Mostafa · Answer 2 · 2015-01-20T18:43:12+0000

If you saved your model using save (), you should use load ()

load_word2vec_format is for a model created by Google, not a model created by gensim

Amir · Answer 3 · 2018-01-06T20:13:54+0000

If you save your model with:

 model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then loading word2vec using the load_word2vec_format method will cause a problem. To make it work, you should use:

 wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing happens when saving a model with:

  model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, you want to load using the KeyedVectors.load method. In this situation, use:

 wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

Keet Sugathadasa · Answer 4 · 2018-12-03T06:21:07+0000

According to other answers, it is important to know how you save the file, because there are certain ways to download it. But you can simply use the unicode_errors='ignore' flag to skip this problem and load the model by unicode_errors='ignore' .

 import gensim model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')

By default, this flag is set to " strict ": unicode_errors='strict' .

According to the documentation, the reason is why such errors occur.

unicode_errors : str, optionally strict by default, is a string suitable for passing as the errors argument to unicode () (Python 2.x) or str () (Python 3.x). If your source file may contain tokens truncated in the middle of a multibyte Unicode character (as is usually the case in the original word2vec.c source tool), this can help to “ignore” or “replace”.

All of the above answers are useful if we can really keep track of how each model has been saved. But what if we have a bunch of models that we need to download and create a common method for it? For this we can use the above flag.

I myself have tested examples when I train several models using the original word2vec.c file , but when I try to load it into gensim , some models will load successfully, and some will give unicode errors, I found that the specified flag is useful and convenient .

Error: codec "utf8" cannot decode byte 0x80 at position 0: invalid start byte

More articles: