Correctly detect the encoding without any clue when using Beautiful Soup

Question

Correctly detect the encoding without any clue when using Beautiful Soup

Im working on improving character encoding support for the Python IRC robot, which retrieves the page headers whose URLs are in the feed.

The current process Im is used as follows:

Requests :

r = requests.get(url, headers={ 'User-Agent': '...' })

Beautiful soup :

soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding)

title = soup.title.string.replace('\n', ' ').replace(...) etc.

The assignment from_encoding=r.encodingis a good start, because it allows us to listen to charsetfrom the headline Content-Typewhen analyzing the page.

Where it falls, on pages that indicate instead of <meta http-equiv … charset=…">or <meta charset="…">(or above) charsetin the heading Content-Type.

The approaches that I see now are as follows:

Unicode, Dammit . , , Ive .
ftfy . , , () .
, <meta>, , , Requests .encoding, , . , Id , .

TL; DR Right Way ™, Beautiful Soup HTML- , ?

+4

python character-encoding beautifulsoup

Delan Azabani 06 . '15 9:27

2

ftfy, , Unicode, Dammit, - , (. bs4/dammit.py). , <meta>, , .

r.text, , charset Content-Type, ISO 8859-1, , Unicode, Dammit , unicode!

r.content:

r = requests.get(url, headers={ 'User-Agent': '...' })
soup = bs4.BeautifulSoup(r.content)
title = soup.title.string.replace('\n', ' ').replace(...) ..

, , , charset Content-Type Unicode, Dammit, BeautifulSoup from_encoding=r.encoding Unicode, Dammit .

0

Delan Azabani 06 . '15 10:40

taleinat · Accepted Answer · 2015-12-06T11:23:12+0000

, , HTTP. UnicodeDammit ( BeautifulSoup) , . , , . (!):

r = requests.get(url, headers={ 'User-Agent': '...' })

is_html = content_type_header.split(';', 1)[0].lower().startswith('text/html')
declared_encoding = UnicodeDammit.find_declared_encoding(r.text, is_html=is_html)

encodings_to_try = [r.encoding]
if declared_encoding is not None:
    encodings_to_try.insert(0, declared_encoding)
soup = bs4.BeautifulSoup(r.text, from_encoding=encodings_to_try)

title = soup.title...

Correctly detect the encoding without any clue when using Beautiful Soup

More articles: