Simple html dom character encoding problem

Hey guys, I use simple html dom to extract content from another site, but the fact is that the problem with character encoding is related to the material obtained using simple html dom. The characters appear as a small diamond with a question mark inside.

A character encoding problem only occurs when the content is extracted, and all the text on my site displays well.

If someone can help, it will be great.

+4
source share
3 answers

Try using iconv to convert the encoding of the scraper text to the encoding that you use on your page.

Signature:

 string iconv ( string $in_charset , string $out_charset , string $str ) 

Example:

 echo iconv("ISO-8859-1", "UTF-8", $text); 
+6
source

Go to the site and check your encoding by looking at the page information.

 $text = iconv(mb_detect_encoding($text), "UTF-8//TRANSLIT//IGNORE", $text); 
+2
source

I also had this problem, but this is not an encoding problem. This is gzip compression that does not handle plain html dom. Here is my solution. Use file_get_html2 file_get_html instead.

 function curl($url){ $headers[] = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"; $headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"; $headers[] = "Accept-Language:en-us,en;q=0.5"; $headers[] = "Accept-Encoding:gzip,deflate"; $headers[] = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $headers[] = "Keep-Alive:115"; $headers[] = "Connection:keep-alive"; $headers[] = "Cache-Control:max-age=0"; $curl = curl_init(); curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_HTTPHEADER, $headers); curl_setopt($curl, CURLOPT_ENCODING, "gzip"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); $data = curl_exec($curl); curl_close($curl); return $data; } function file_get_html2($url){ return str_get_html(curl($url)); } 
+2
source

Source: https://habr.com/ru/post/1333625/


All Articles