First, to use the REST APIs they provide, you need to create an account. Subsequently, you can create your own token
for use in the call. token
provided by the examples will not work as it is intentionally invalid. His goal, for example, is only.
Secondly, make sure the allow_url_fopen
directive in your php.ini
set to true
. For the purpose of the test script, or if you cannot modify your php.ini
(shared hosting solutions), you can use ini_set('allow_url_fopen', true);
at the top of the page.
Finally, in order to analyze the images yourself, you will need to extract all the image elements from the DOM that you are extracting. Sometimes there will be no images, and sometimes it will. It depends on which page you are pulling. In addition, you will need to resolve relative paths ...
Your code
require 'readability/lib/Readability.inc.php'; $url = 'http://www.nextbigwhat.com'; $html = file_get_contents($url); $Readability = new Readability($html); // default charset is utf-8 $ReadabilityData = $Readability->getContent(); $image= $ReadabilityData['lead_image_url']; $title= $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; echo "$content";
After you perform Readability
you can use the DOMDocument
class to retrieve your images from the content you pulled. Create a new DOMDocument
and load it into your HTML. Be sure to use the libxml_use_internal_errors
function to fix the errors caused by the parser on most websites. We will put this in a function to simplify use elsewhere, if necessary.
function sampleDomMedia($html) { // Supress validator errors libxml_use_internal_errors(true); // New document $dom = new DOMDocument(); // Populate document $dom->loadHTML($html); //[...]
Now you can get all the image elements from the document you created, and then get your src
attribute ... like this:
//[...] // Get image elements $nodeList = $dom->getElementsByTagName('img'); // Get length $length = $nodeList->length; // Initialize array $images = array(); // Iterate over our nodes for($i=0;$i<$length;$i++) { // Get the current node $node = $nodeList->item($i); // Retrieve the src attribute $image = $node->getAttribute('src'); // Push image src into $images array array_push($images,$image); } return $images; }
You now have an array of images that you can present to the user for use. But before you do this, we forgot one more thing ... We want to allow all the relative paths so that we always have an absolute path to the image that lives on another site.
To do this, we need to determine the URL of the base area and the relative path to the current page with which we are working. We can do this using the parse_url()
function provided by PHP. For simplicity, we can turn this into a function.
function getUrls($url) { // Parse URL $urlArr = parse_url($url); // Determine Base URL, with scheme, host, and port $base = $urlArr['scheme']."://".$urlArr['host']; if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) { $base .= ":".$urlArr['port']; } // Truncate the Path using the position of the last forward slash $relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1); // Return our two URLs return array($base, $relative); }
Add an extra parameter to the original sampleDomMedia
function, and we can call this function to get our paths. Then we can check the value of the src
attribute to determine which path it is and resolve it.
function sampleDomMedia($html, $url) { // Retrieve our URLs list($baseUrl, $relativeUrl) = getUrls($url); libxml_use_internal_errors(true); $dom = new DOMDocument(); $dom->loadHTML($html); $nodeList = $dom->getElementsByTagName('img'); $length = $nodeList->length; $images = array(); for($i=0;$i<$length;$i++) { $node = $nodeList->item($i); $image = $node->getAttribute('src'); // Resolve relative paths if(substr($image,0,2)=="//") { // Missing protocol $image = "http:".$image; } else if(substr($image,0,1)=="/") { // Path Relative to Base $image = $baseUrl.$image; } else if(substr($image,0,4)!=="http") { // Path Relative to Dimension $image = $relativeUrl.$image; } array_push($images,$image); } return $images; }
And last, but certainly not least, we are left with the two previous functions, and this piece of procedural code:
require 'readability/lib/Readability.inc.php'; $url = 'http://www.nextbigwhat.com'; $html = file_get_contents($url); $Readability = new Readability($html); // default charset is utf-8 $ReadabilityData = $Readability->getContent(); $image = $ReadabilityData['lead_image_url']; $images = sampleDomMedia($html, $url); $title = $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; echo "$content";
In addition, if you think that the content of the article may have an image inside it (usually this is not so), you can use the contents
returned from Readability
rather than the $html
variable, for example:
$title = $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; $images = sampleDomMedia($content, $url);
I hope this helps.