UTF-8 string decoding in Python

I am writing a web crawler in python and it includes headers from websites.

One of the headlines was to read: "And the arrival of hip also"

But instead, he said, "And the coming of Hip, too."

What is wrong here?

+4
source share
2 answers

You need to correctly decode the source text. Most likely, the source text is in UTF-8 format, and not in ASCII.

Since you are not providing any context or code for your question, it is impossible to give a direct answer.

I suggest you learn how Unicode and character encoding is done in Python:

http://docs.python.org/2/howto/unicode.html

+6
source

This is a coding error - therefore, if it is a Unicode string, this should fix it:

text.encode("windows-1252").decode("utf-8") 

If this is a simple line, you will need an additional step:

 text.decode("utf-8").encode("windows-1252").decode("utf-8") 

Both of them will give you a Unicode string.

By the way, to find out how due to encoding problems due to encoding problems some text, you can use chardet :

 >>> import chardet >>> chardet.detect(u"And the Hipรขโ‚ฌโ„ขs coming, too") {'confidence': 0.5, 'encoding': 'windows-1252'} 
+12
source

Source: https://habr.com/ru/post/1442615/


All Articles