PHP Scrape Article Excerpt as Readability

Question

PHP Scrape Article Excerpt as Readability

I saw this question , but actually it does not satisfy what I am looking for. The answers to this question were either: lifting from the metadata description tag, and the second - creating an excerpt for an article in which you already have a body.

What I want to do is actually get the first few sentences of the article, such as Readability. What is the best way to do this? HTML parsing? Here is what I am using now, but it is not very reliable.

function guessExcerpt($url) { $html = file_get_contents_curl($url); $doc = new DOMDocument(); @$doc->loadHTML($html); $metas = $doc->getElementsByTagName('meta'); for ($i = 0; $i < $metas->length; $i++) { $meta = $metas->item($i); if($meta->getAttribute('name') == 'description') $description = $meta->getAttribute('content'); } return $description; } function file_get_contents_curl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 5); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $data = curl_exec($ch); curl_close($ch); return $data; }

+6

php web scraping

Alfo Jul 30 '12 at 16:19

source share

1 answer

Muhammad abrar · Accepted Answer · 2012-09-23T04:11:28+0000

Here is the readability port in PHP: https://github.com/feelinglucky/php-readability . Just give it a try. The result of the extraction will be similar to Readability (since it implements a readability algorithm).

 require 'lib/Readability.inc.php'; $html = file_get_contents_curl($url); $Readability = new Readability($html, $html_input_charset); // default charset is utf-8 $ReadabilityData = $Readability->getContent(); $title = $ReadabilityData['title']; $content = $ReadabilityData['content'];

You can then use some sentences from $content as an excerpt.

PHP Scrape Article Excerpt as Readability

More articles: