Correctly detect the encoding without any clue when using Beautiful Soup

Im working on improving character encoding support for the Python IRC robot, which retrieves the page headers whose URLs are in the feed.

The current process Im is used as follows:

  • Requests :

    r = requests.get(url, headers={ 'User-Agent': '...' })
    
  • Beautiful soup :

    soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding)
    
  • title = soup.title.string.replace('\n', ' ').replace(...) etc.

The assignment from_encoding=r.encodingis a good start, because it allows us to listen to charsetfrom the headline Content-Typewhen analyzing the page.

Where it falls, on pages that indicate instead of <meta http-equiv … charset=…">or <meta charset="…">(or above) charsetin the heading Content-Type.

The approaches that I see now are as follows:

TL; DR Right Way ™, Beautiful Soup HTML- , ?

+4
2

, , HTTP. UnicodeDammit ( BeautifulSoup) , . , , . (!):

r = requests.get(url, headers={ 'User-Agent': '...' })

is_html = content_type_header.split(';', 1)[0].lower().startswith('text/html')
declared_encoding = UnicodeDammit.find_declared_encoding(r.text, is_html=is_html)

encodings_to_try = [r.encoding]
if declared_encoding is not None:
    encodings_to_try.insert(0, declared_encoding)
soup = bs4.BeautifulSoup(r.text, from_encoding=encodings_to_try)

title = soup.title...
+1

ftfy, , Unicode, Dammit, - , (. bs4/dammit.py). , <meta>, , .

r.text, , charset Content-Type, ISO 8859-1, , Unicode, Dammit , unicode!

r.content:

  • r = requests.get(url, headers={ 'User-Agent': '...' })
  • soup = bs4.BeautifulSoup(r.content)
  • title = soup.title.string.replace('\n', ' ').replace(...) ..

, , , charset Content-Type Unicode, Dammit, BeautifulSoup from_encoding=r.encoding Unicode, Dammit .

0

Source: https://habr.com/ru/post/1618766/


All Articles