Font or problem with Unicode on the scraper

Question

Font or problem with Unicode on the scraper

I am trying to clear the information from the site.

On that website

127 East Zhongshan No 2 Rd; 中山东二路127号

But when I try to cancel it and repeat it, it will show

127 East Zhongshan No 2 Rd; ä¸å±±ä¸äºè·¯127å·

I am also trying to use UTF-8

There is my php code

now please help me solve this problem.

function GrabPage($site){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_TIMEOUT, 40);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_URL, $site);
    ob_start();
    return curl_exec ($ch);
    ob_end_clean();
    curl_close ($ch);
}
$GrabData   = GrabPage($site);

$dom    = new DOMDocument();
@$dom->loadHTML($GrabData);

$xpath  = new DOMXpath($dom);


$mainElements = array();
$mainElements = $xpath->query("//div[@class='col--one-whole mv--col--one-half wv--col--one-whole'][1]/dl/dt");

foreach ($mainElements as $Names2) {
    $Name2  = $Names2->nodeValue;
    echo "$Name2";
}

+4

html php xpath domdocument web-scraping

Feroz ahmed Apr 23 '15 at 5:24

source share

2 answers

First of all, you need to check whether the captured HTML source is encoded correctly. If yes, try

utf8_decode($Name2)

,

0

Clain Dsilva 23 . '15 5:43

Ghost · Accepted Answer · 2015-04-23T05:42:16+0000

First, you need to set the encoding before anything on top of the PHP file:

header('Content-Type: text/html; charset=utf-8');

You need to convert the html markup you got with mb_convert_encoding:

@$dom->loadHTML(mb_convert_encoding($GrabData, 'HTML-ENTITIES', 'UTF-8'));

Output example

Font or problem with Unicode on the scraper

More articles: