Python encoding issue: degree sign and others

Question

Python encoding issue: degree sign and others

I use BeautifulSoup to clear data from a web page. I want to compare website data with text that is in a .txt document. However, it looks like I'm having encoding problems.

The website has the text "heated oven to 400 °". The text also looks like this: "view source" (without html objects).

The website is read using beautifulSoup:

source = "my url".read() .... soup = BeautifulSoup(source)

The text document was created by creating a new text document encoded as "Encoding in UTF-8 without specification." Then I copied the “heated oven to 400 °” from the website into a text document and saved.

The text file reads as

 f = codecs.open('myfilename', encoding='utf-8')

When I compare two strings, they are not equal, but I want them to be.

To find out what is going on: in Eclipse, I split the two texts and, looking at the variables in debug mode, I see that the degree sign from BeautifulSoup is displayed as \ xc2 \ xb0. The degree sign from a text document simply displays as \ xb0.

Why and how to fix it? I have this problem with many special characters, so I need a general solution. In addition, I will copy data from several sites into a text document.

+4

python encoding beautifulsoup

user984003 Jan 30 '12 at 5:34

source share

1 answer

minopret · Accepted Answer · 2012-01-30T06:02:52+0000

Beautiful Soup doesn't seem to need to correctly determine the encoding. You can give a hint by replacing BeautifulSoup (source) with BeautifulSoup (source, fromEncoding = 'UTF-8'). Additional parameters and information are online on Beautiful Soup gives you Unicode, Dammit . "

The byte '\ xc2 \ xb0' is what you get when Unicode's UIC 0040 UIC 0040 is mistaken for specifying the beautiful Beautiful Soup resort in Windows 1252.

Python encoding issue: degree sign and others

More articles: