Need a good HTML parser in php

Question

Need a good HTML parser in php

Found this http://simplehtmldom.sourceforge.net/ , but it could not work

extracting this page http://php.net/manual/en/function.curl-setopt.php
and parse it to plain html, it failed and returned a partial html page

This is what I want to do. Go to the html page and get the components individually (the contents of all div and p in the hierarchy) I like the functions of simplehtmldom, any such parser is needed, which is good in the whole code (best and worst).

+3

html parsing

goutham Dec 9 '09 at 12:02

source share

3 answers

Pascal MARTIN · Answer 1 · 2009-12-09T12:05:59+0000

I often use DOMDocument::loadHTMLthat works not so bad, in general cases - and I like to request documents as soon as they are loaded as the DOM, Xpath.

, , , HTML- , ... , - - ...

uberweb · Answer 2 · 2009-12-09T13:06:04+0000

...

CURL XPATH. , .

protected function _get_xpath($url) {
    $refferer='http://www.whatever.com/';
    $useragent='Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    // create curl resource
    $ch = curl_init();

    // set url
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
    curl_setopt ($ch, CURLOPT_REFERER, $refferer);
    curl_setopt($ch, CURLOPT_URL, $url);

    //return the transfer as a string
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    // $output contains the output string
    $output = curl_exec($ch);
    //echo htmlentities($output);

    if(curl_errno($ch)) {
        echo 'Curl error: ' . curl_error($ch);
    }
    else {
        $dom = new DOMDocument();
        @$dom->loadHTML($output);
        $this->xpath = new DOMXPath($dom);
        $this->html = $output;
    }

    // close curl resource to free up system resources
    curl_close($ch);
}

evaluate

$resultDom = $this->xpath->evaluate("//span[@id='headerResults']/strong");
$this->results = $resultDom->item(0)->nodeValue;

goutham · Answer 3 · 2009-12-13T07:54:25+0000

I found the best for my use here - http://querypath.org/

Need a good HTML parser in php

More articles: