This has something to do with my question here .
I process tons of texts (mostly HTML and XML) received through HTTP. I am looking for a python library that can do smart coding based on different strategies and convert texts to Unicode using the best possible character encoding.
I found that chardet does automatic detection very well. However, automatic detection of everything is a problem, because it is SLOW and very against all standards. By chardet FAQ I do not want to embed standards.
From the same FAQ, here is a list of places where I want to look for an encoding:
- charset parameter in the header of the HTTP
Content-type . Element <meta http-equiv="content-type"> in the <head> web page for HTML documents.- coding attribute in XML prolog for XML documents.
- Automatically detect character encoding as a last resort.
Basically, I want to be able to look at all this place, and also automatically process conflicting information.
Is there such a library there or do I need to write it myself?
source share