Need a good HTML parser in php

Found this http://simplehtmldom.sourceforge.net/ , but it could not work

extracting this page http://php.net/manual/en/function.curl-setopt.php
and parse it to plain html, it failed and returned a partial html page

This is what I want to do. Go to the html page and get the components individually (the contents of all div and p in the hierarchy) I like the functions of simplehtmldom, any such parser is needed, which is good in the whole code (best and worst).

+3
source share
3 answers

I often use DOMDocument::loadHTMLthat works not so bad, in general cases - and I like to request documents as soon as they are loaded as the DOM, Xpath.

, , , HTML- , ... , - - ...

+5

...

CURL XPATH. , .

protected function _get_xpath($url) {
    $refferer='http://www.whatever.com/';
    $useragent='Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    // create curl resource
    $ch = curl_init();

    // set url
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
    curl_setopt ($ch, CURLOPT_REFERER, $refferer);
    curl_setopt($ch, CURLOPT_URL, $url);

    //return the transfer as a string
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    // $output contains the output string
    $output = curl_exec($ch);
    //echo htmlentities($output);

    if(curl_errno($ch)) {
        echo 'Curl error: ' . curl_error($ch);
    }
    else {
        $dom = new DOMDocument();
        @$dom->loadHTML($output);
        $this->xpath = new DOMXPath($dom);
        $this->html = $output;
    }

    // close curl resource to free up system resources
    curl_close($ch);
}

evaluate

$resultDom = $this->xpath->evaluate("//span[@id='headerResults']/strong");
$this->results = $resultDom->item(0)->nodeValue;
0

I found the best for my use here - http://querypath.org/

0
source

Source: https://habr.com/ru/post/1725151/


All Articles