Using the readability API to clear the most relevant image from the page

I use the readability API for this. In their example, they show lead_img_url , but I could not get it.

REFER: https://www.readability.com/developers/api/parser

This is the right way to make a direct request:

he says: {"messages": "The API Key in the form of the 'token' parameter is invalid.", "error": true}

Another attempt:

 <?php define('TOKEN', "1b830931777ac7c2ac954e9f0d67df437175e66e"); define('API_URL', "https://www.readability.com/api/content/v1/parser?url=%s&token=%s"); function get_image($url) { // sanitize it so we don't break our api url $encodedUrl = urlencode($url); $TOKEN = '1b830931777ac7c2ac954e9f0d67df437175e66e'; $API_URL = 'https://www.readability.com/api/content/v1/parser?url=%s&token=%s'; // $API_URL = 'http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas'; // build our url $url = sprintf($API_URL, $encodedUrl, $TOKEN); // call the api $response = file_get_contents($url); if( $response ) { return false; } $json = json_decode($response); if(!isset($json['lead_image_url'])) { return false; } return $json['lead_image_url']; } 

Error: Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&amp;token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32 Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&amp;token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32

Another:

 require 'readability/lib/Readability.inc.php'; $url = 'http://www.nextbigwhat.com'; $html = file_get_contents($url); $Readability = new Readability($html); // default charset is utf-8 $ReadabilityData = $Readability->getContent(); $image= $ReadabilityData['lead_image_url']; $title= $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; echo "$content"; 

It says: Notice: Undefined index: lead_image_url in F:\wamp\www\inviteold\test2.php on line 13

+2
source share
1 answer

First, to use the REST APIs they provide, you need to create an account. Subsequently, you can create your own token for use in the call. token provided by the examples will not work as it is intentionally invalid. His goal, for example, is only.

Secondly, make sure the allow_url_fopen directive in your php.ini set to true . For the purpose of the test script, or if you cannot modify your php.ini (shared hosting solutions), you can use ini_set('allow_url_fopen', true); at the top of the page.

Finally, in order to analyze the images yourself, you will need to extract all the image elements from the DOM that you are extracting. Sometimes there will be no images, and sometimes it will. It depends on which page you are pulling. In addition, you will need to resolve relative paths ...

Your code

 require 'readability/lib/Readability.inc.php'; $url = 'http://www.nextbigwhat.com'; $html = file_get_contents($url); $Readability = new Readability($html); // default charset is utf-8 $ReadabilityData = $Readability->getContent(); $image= $ReadabilityData['lead_image_url']; $title= $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; echo "$content"; 

After you perform Readability you can use the DOMDocument class to retrieve your images from the content you pulled. Create a new DOMDocument and load it into your HTML. Be sure to use the libxml_use_internal_errors function to fix the errors caused by the parser on most websites. We will put this in a function to simplify use elsewhere, if necessary.

 function sampleDomMedia($html) { // Supress validator errors libxml_use_internal_errors(true); // New document $dom = new DOMDocument(); // Populate document $dom->loadHTML($html); //[...] 

Now you can get all the image elements from the document you created, and then get your src attribute ... like this:

  //[...] // Get image elements $nodeList = $dom->getElementsByTagName('img'); // Get length $length = $nodeList->length; // Initialize array $images = array(); // Iterate over our nodes for($i=0;$i<$length;$i++) { // Get the current node $node = $nodeList->item($i); // Retrieve the src attribute $image = $node->getAttribute('src'); // Push image src into $images array array_push($images,$image); } return $images; } 

You now have an array of images that you can present to the user for use. But before you do this, we forgot one more thing ... We want to allow all the relative paths so that we always have an absolute path to the image that lives on another site.

To do this, we need to determine the URL of the base area and the relative path to the current page with which we are working. We can do this using the parse_url() function provided by PHP. For simplicity, we can turn this into a function.

 function getUrls($url) { // Parse URL $urlArr = parse_url($url); // Determine Base URL, with scheme, host, and port $base = $urlArr['scheme']."://".$urlArr['host']; if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) { $base .= ":".$urlArr['port']; } // Truncate the Path using the position of the last forward slash $relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1); // Return our two URLs return array($base, $relative); } 

Add an extra parameter to the original sampleDomMedia function, and we can call this function to get our paths. Then we can check the value of the src attribute to determine which path it is and resolve it.

 function sampleDomMedia($html, $url) { // Retrieve our URLs list($baseUrl, $relativeUrl) = getUrls($url); libxml_use_internal_errors(true); $dom = new DOMDocument(); $dom->loadHTML($html); $nodeList = $dom->getElementsByTagName('img'); $length = $nodeList->length; $images = array(); for($i=0;$i<$length;$i++) { $node = $nodeList->item($i); $image = $node->getAttribute('src'); // Resolve relative paths if(substr($image,0,2)=="//") { // Missing protocol $image = "http:".$image; } else if(substr($image,0,1)=="/") { // Path Relative to Base $image = $baseUrl.$image; } else if(substr($image,0,4)!=="http") { // Path Relative to Dimension $image = $relativeUrl.$image; } array_push($images,$image); } return $images; } 

And last, but certainly not least, we are left with the two previous functions, and this piece of procedural code:

 require 'readability/lib/Readability.inc.php'; $url = 'http://www.nextbigwhat.com'; $html = file_get_contents($url); $Readability = new Readability($html); // default charset is utf-8 $ReadabilityData = $Readability->getContent(); $image = $ReadabilityData['lead_image_url']; $images = sampleDomMedia($html, $url); $title = $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; echo "$content"; 

In addition, if you think that the content of the article may have an image inside it (usually this is not so), you can use the contents returned from Readability rather than the $html variable, for example:

 $title = $ReadabilityData['title']; //This works fine. $content = $ReadabilityData['word_count']; $images = sampleDomMedia($content, $url); 

I hope this helps.

+4
source

Source: https://habr.com/ru/post/921688/


All Articles