I am looking for a cheek on a Chinese site using PHP and CURL . I used to have a problem with compressed results, and SO helped me figure it out. Now I am facing a problem while parsing content using PHP - DOMDocument . The error is as follows:
Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..
Even though this warning does not give further results.
My code is below:
$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312'));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, "");
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');
$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[@class="test"]//a/@href');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
I found the content type on my target website as
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
So, I tried to convert the result to utf-8.
DOMDocument:: loadHTML(), -, .
, . Thanx .
( HTML DOM, . SO . PHP DOM Parser)