Python encoding issue: degree sign and others

I use BeautifulSoup to clear data from a web page. I want to compare website data with text that is in a .txt document. However, it looks like I'm having encoding problems.

The website has the text "heated oven to 400 Β°". The text also looks like this: "view source" (without html objects).

The website is read using beautifulSoup:

source = "my url".read() .... soup = BeautifulSoup(source) 

The text document was created by creating a new text document encoded as "Encoding in UTF-8 without specification." Then I copied the β€œheated oven to 400 °” from the website into a text document and saved.

The text file reads as

 f = codecs.open('myfilename', encoding='utf-8') 

When I compare two strings, they are not equal, but I want them to be.

To find out what is going on: in Eclipse, I split the two texts and, looking at the variables in debug mode, I see that the degree sign from BeautifulSoup is displayed as \ xc2 \ xb0. The degree sign from a text document simply displays as \ xb0.

Why and how to fix it? I have this problem with many special characters, so I need a general solution. In addition, I will copy data from several sites into a text document.

+4
source share
1 answer

Beautiful Soup doesn't seem to need to correctly determine the encoding. You can give a hint by replacing BeautifulSoup (source) with BeautifulSoup (source, fromEncoding = 'UTF-8'). Additional parameters and information are online on Beautiful Soup gives you Unicode, Dammit . "

The byte '\ xc2 \ xb0' is what you get when Unicode's UIC 0040 UIC 0040 is mistaken for specifying the beautiful Beautiful Soup resort in Windows 1252.

+1
source

Source: https://habr.com/ru/post/1393691/


All Articles