How do I convert web content to a consistent character set when crawling on the Internet?

Question

How do I convert web content to a consistent character set when crawling on the Internet?

I have done a lot of research on this and a lot of testing.

As I understand it, HTTP headers are set only if the web server is configured for this and can by default use a specific encoding, even if the developers do not intend to do this. Meta headers are set only if the developer decided to do this in his code ... it can also be automatically set using some development frameworks (which is problematic if the developer did not take this into account).

I found that if they are installed at all, they often conflict with each other. eg. The HTTP header says the page is iso-8859-1 , while the meta tag indicates windows-1252 . I could assume that one will replace the other (probably a meta tag), but this seems rather unreliable. It seems that very few developers consider this when working with their data, so dynamically created sites often mix encodings or use encodings that they do not intend to use with different encodings coming from their database.

My conclusion was this:

Check the encoding of each page with mb_detect_encoding() .
If this fails, I use metacoding ( http-equiv="Content-Type"... ).
If there is no meta content, I use HTTP headers ( content_type ).
If there is no http type of content, I assume UTF-8.
Finally, I am converting a document using mb_convert_encoding (). Then I clear it for content. (I deliberately abandoned the encoding for conversion to avoid this discussion here.)

I'm trying to get as much accurate content as possible, and not just ignore web pages because the developers did not set their headers properly.

What problems do you see with this approach?

Will I encounter problems using the mb_detect_encoding () and mb_convert_encoding () methods?

+4

html php character-encoding web-scraping

T. Brian Jones Nov 22 '11 at 20:24

source share

1 answer

Zoltán balázs · Accepted Answer · 2011-11-22T23:30:12+0000

Yes, you will have problems. mb_detect_encoding not very reliable, see the following examples:

This outputs bool(false) indicating that decompression failed:

 var_dump(mb_detect_encoding(file_get_contents('http://www.pazaruvaj.com/')));

This other one prints string(5) "UTF-8" , which is obviously the wrong result. The HTTP and http-equiv headers are correctly set on this website, and this is not valid UTF-8:

 var_dump(mb_detect_encoding(file_get_contents('http://www.arukereso.hu/')));

I suggest you apply all available methods, as well as use external libraries (for example, this: http://mikolajj.republika.pl/ ) and use the most probable coding.

Another approach to clarify this is to create a list of possible character sets for a specific country and use only those with mb_convert_encoding . As in Hungary, it is most likely that ISO-8859-2 or UTF-8, others are not worthy of attention. A country can be guessed from a combination of TLD, Content-Language HTTP header, and IP address location. Although this requires some research and further development, it can be worth the effort.

Some comments in the mb_convert_encoding documentation say that iconv works better for Japanese character sets.

How do I convert web content to a consistent character set when crawling on the Internet?

More articles: