Php webscraping using plain html dom doesn't work when output doesn't match html tags

Question

Php webscraping using plain html dom doesn't work when output doesn't match html tags

I want to cancel some information on a web page. It uses a table layout structure.

I want to extract a third table inside a layout of a nested table that contains a series of nested tables. Each publication result. But the code does not work.

include('simple_html_dom.php'); $url = 'http://exams.keralauniversity.ac.in/Login/index.php?reslt=1'; $html = file_get_contents($url); $result =$html->find("table", 2); echo $result;

I used Curl to retrieve a website, but the problem is that its tags are not working, so it cannot be retrieved with a simple dom element.

  function curl($url) { $ch = curl_init(); // Initialising cURL curl_setopt($ch, CURLOPT_URL,$url); // Setting cURL URL option with the $url variable passed into the function curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL option to return the webpage data $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } function scrape_between($data, $start, $end){ $data = stristr($data, $start); // Stripping all data from before $start $data = substr($data, strlen($start)); // Stripping $start $stop = stripos($data, $end); // Getting the position of the $end of the data to scrape $data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape return $data; // Returning the scraped data from the function } $scraped_page = curl($url); // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable $scraped_data = scrape_between($scraped_page, ' </html>', '</table></td><td></td></tr> </table>'); echo $scraped_data; $myfile = fopen("newfile.html", "w") or die("Unable to open file!"); fwrite($myfile, $scraped_data); fclose($myfile);

How to clear result and save PDF

+5

php web-scraping simple-html-dom

Sachin divakar Nov 02 '15 at 9:20

source share

2 answers

 Find a sample code ?php // Defining the basic cURL function function curl($url) { $ch = curl_init(); // Initialising cURL curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL URL option with the $url variable passed into the function curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL option to return the webpage data $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } ?> <?php $scraped_website = curl("http://www.example.com"); // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable $result =$substring($scraped_website ,11,7); //change values 11,7 for table echo $result; ?>

+1

L ananta prasad Nov 02 '15 at 9:28

source share

pguardiario · Accepted Answer · 2015-11-21T22:19:21+0000

Simple HTML Dom cannot handle this html. So first switch to this library , then do:

 require_once('advanced_html_dom.php'); $dom = file_get_html('http://exams.keralauniversity.ac.in/Login/index.php?reslt=1'); $rows = array(); foreach($dom->find('tr.Function_Text_Normal:has(td[3])') as $tr){ $row['num'] = $tr->find('td[2]', 0)->text; $row['text'] = $tr->find('td[3]', 0)->text; $row['pdf'] = $tr->find('td[3] a', 0)->href; if(preg_match_all('/\d+/', $tr->parent->find('u', 0)->text, $m)){ list($row['day'], $row['month'], $row['year']) = $m[0]; } // uncomment next 2 lines to save the pdf // $filename = preg_replace('/.*\//', '', $row['pdf']); // file_put_contents($filename, file_get_contents($row['pdf'])); $rows[] = $row; } var_dump($rows);

Php webscraping using plain html dom doesn't work when output doesn't match html tags

More articles: