DOMDocument :: loadHTML (): input conversion failed due to input error

Question

DOMDocument :: loadHTML (): input conversion failed due to input error

I am looking for a cheek on a Chinese site using PHP and CURL . I used to have a problem with compressed results, and SO helped me figure it out. Now I am facing a problem while parsing content using PHP - DOMDocument . The error is as follows:

Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..

Even though this warning does not give further results.

My code is below:

$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init(); 
curl_setopt($curl, CURLOPT_URL,$url); 
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312')); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);  
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, "");  // handling all compressions 
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');

$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);

$xpath = new DOMXpath($doc);

$elements = $xpath->query('//div[@class="test"]//a/@href');

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }
  }
}

I found the content type on my target website as

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

So, I tried to convert the result to utf-8.

DOMDocument:: loadHTML(), -, . , . Thanx .

( HTML DOM, . SO . PHP DOM Parser)

+4

dom php curl parsing web-scraping

Surabhil 29 . '14 9:14

3

jianyong · Answer 1 · 2015-09-26T13:30:59+0000

.

$html=new DOMDocument();  
$html_source    = get_html();
$html_source    =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );

Reid Johnson · Answer 2 · 2015-01-20T05:49:06+0000

, , , , , DomDocument ISO-8859-1 【( "" gb2312), 0x80 , ISO-8859-1. , DomDocument, . , - .

, , html , . mb_convert_encoding iconv iso-5589-1 utf-8, , , DomDocument, , .

Rafał · Answer 3 · 2016-03-24T10:07:03+0000

<?php
$contents = file_get_contents('xml.xml');
function convert_utf8( $string ) { 
    if ( strlen(utf8_decode($string)) == strlen($string) ) {   
        // $string is not UTF-8
        return iconv("ISO-8859-1", "UTF-8", $string);
    } else {
        // already UTF-8
        return $string;
    }
}

$contents = mb_convert_encoding( $contents, mb_detect_encoding($contents), "UTF-8");

$xml = simplexml_load_string(convert_utf8($contents));
print_r($xml);

DOMDocument :: loadHTML (): input conversion failed due to input error

More articles: