According to other answers, it is important to know how you save the file, because there are certain ways to download it. But you can simply use the unicode_errors='ignore' flag to skip this problem and load the model by unicode_errors='ignore' .
import gensim model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')
By default, this flag is set to " strict ": unicode_errors='strict' .
According to the documentation, the reason is why such errors occur.
unicode_errors : str, optionally strict by default, is a string suitable for passing as the errors argument to unicode () (Python 2.x) or str () (Python 3.x). If your source file may contain tokens truncated in the middle of a multibyte Unicode character (as is usually the case in the original word2vec.c source tool), this can help to βignoreβ or βreplaceβ.
All of the above answers are useful if we can really keep track of how each model has been saved. But what if we have a bunch of models that we need to download and create a common method for it? For this we can use the above flag.
I myself have tested examples when I train several models using the original word2vec.c file , but when I try to load it into gensim , some models will load successfully, and some will give unicode errors, I found that the specified flag is useful and convenient .
source share