Yes, you will have problems. mb_detect_encoding
not very reliable, see the following examples:
This outputs bool(false)
indicating that decompression failed:
var_dump(mb_detect_encoding(file_get_contents('http://www.pazaruvaj.com/')));
This other one prints string(5) "UTF-8"
, which is obviously the wrong result. The HTTP and http-equiv
headers are correctly set on this website, and this is not valid UTF-8:
var_dump(mb_detect_encoding(file_get_contents('http://www.arukereso.hu/')));
I suggest you apply all available methods, as well as use external libraries (for example, this: http://mikolajj.republika.pl/ ) and use the most probable coding.
Another approach to clarify this is to create a list of possible character sets for a specific country and use only those with mb_convert_encoding
. As in Hungary, it is most likely that ISO-8859-2 or UTF-8, others are not worthy of attention. A country can be guessed from a combination of TLD, Content-Language
HTTP header, and IP address location. Although this requires some research and further development, it can be worth the effort.
Some comments in the mb_convert_encoding
documentation say that iconv
works better for Japanese character sets.
source share