Simple html dom character encoding problem

Question

Simple html dom character encoding problem

Hey guys, I use simple html dom to extract content from another site, but the fact is that the problem with character encoding is related to the material obtained using simple html dom. The characters appear as a small diamond with a question mark inside.

A character encoding problem only occurs when the content is extracted, and all the text on my site displays well.

If someone can help, it will be great.

+4

php character-encoding simple-html-dom

Belgin fish Dec 29 '10 at 1:45

source share

3 answers

Go to the site and check your encoding by looking at the page information.

 $text = iconv(mb_detect_encoding($text), "UTF-8//TRANSLIT//IGNORE", $text);

+2

Dejan marjanovic Dec 29 '10 at 1:54

source share

I also had this problem, but this is not an encoding problem. This is gzip compression that does not handle plain html dom. Here is my solution. Use file_get_html2 file_get_html instead.

 function curl($url){ $headers[] = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"; $headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"; $headers[] = "Accept-Language:en-us,en;q=0.5"; $headers[] = "Accept-Encoding:gzip,deflate"; $headers[] = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $headers[] = "Keep-Alive:115"; $headers[] = "Connection:keep-alive"; $headers[] = "Cache-Control:max-age=0"; $curl = curl_init(); curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_HTTPHEADER, $headers); curl_setopt($curl, CURLOPT_ENCODING, "gzip"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); $data = curl_exec($curl); curl_close($curl); return $data; } function file_get_html2($url){ return str_get_html(curl($url)); }

+2

Rollingo Jun 20 '12 at 18:19

source share

karim79 · Accepted Answer · 2010-12-29T01:51:42+0000

Try using iconv to convert the encoding of the scraper text to the encoding that you use on your page.

Signature:

 string iconv ( string $in_charset , string $out_charset , string $str )

Example:

 echo iconv("ISO-8859-1", "UTF-8", $text);

Simple html dom character encoding problem

More articles: