Python coding detection library

This has something to do with my question here .

I process tons of texts (mostly HTML and XML) received through HTTP. I am looking for a python library that can do smart coding based on different strategies and convert texts to Unicode using the best possible character encoding.

I found that chardet does automatic detection very well. However, automatic detection of everything is a problem, because it is SLOW and very against all standards. By chardet FAQ I do not want to embed standards.

From the same FAQ, here is a list of places where I want to look for an encoding:

  • charset parameter in the header of the HTTP Content-type . Element
  • <meta http-equiv="content-type"> in the <head> web page for HTML documents.
  • coding attribute in XML prolog for XML documents.
  • Automatically detect character encoding as a last resort.

Basically, I want to be able to look at all this place, and also automatically process conflicting information.

Is there such a library there or do I need to write it myself?

+4
source share
2 answers

BeautifulSoup (the html parser) includes the UnicodeDammit class, which does just that. Take a look and see if you like it.

+3
source

BeautifulSoup UnicodeDammit , which in turn uses chardet .

chardet itself is quite useful for the general case (defining a text encoding), but slow as you say. UnicodeDammit adds additional functions on top of chardet , in particular that it can look for the encoding explicitly specified in the XML encoding tags.

As for the Content-type HTTP header, I think you need to read this to extract the charset parameter and then pass it to UnicodeDammit in the fromEncoding parameter.

Regarding conflict resolution, UnicodeDammit will take precedence over explicit encoding (if the encoding does not cause errors). See the docs for more details.

+10
source

Source: https://habr.com/ru/post/1301980/


All Articles