Python coding detection library

Question

Python coding detection library

This has something to do with my question here .

I process tons of texts (mostly HTML and XML) received through HTTP. I am looking for a python library that can do smart coding based on different strategies and convert texts to Unicode using the best possible character encoding.

I found that chardet does automatic detection very well. However, automatic detection of everything is a problem, because it is SLOW and very against all standards. By chardet FAQ I do not want to embed standards.

From the same FAQ, here is a list of places where I want to look for an encoding:

charset parameter in the header of the HTTP Content-type . Element
<meta http-equiv="content-type"> in the <head> web page for HTML documents.
coding attribute in XML prolog for XML documents.
Automatically detect character encoding as a last resort.

Basically, I want to be able to look at all this place, and also automatically process conflicting information.

Is there such a library there or do I need to write it myself?

+4

python html http xml character-encoding

parxier Feb 21 '10 at 22:55

source share

2 answers

BeautifulSoup UnicodeDammit , which in turn uses chardet .

chardet itself is quite useful for the general case (defining a text encoding), but slow as you say. UnicodeDammit adds additional functions on top of chardet , in particular that it can look for the encoding explicitly specified in the XML encoding tags.

As for the Content-type HTTP header, I think you need to read this to extract the charset parameter and then pass it to UnicodeDammit in the fromEncoding parameter.

Regarding conflict resolution, UnicodeDammit will take precedence over explicit encoding (if the encoding does not cause errors). See the docs for more details.

+10

Craig McQueen Feb 21 '10 at 23:52

source share

drxzcl · Accepted Answer · 2010-02-21T22:58:15+0000

BeautifulSoup (the html parser) includes the UnicodeDammit class, which does just that. Take a look and see if you like it.

Python coding detection library

More articles: