PHP Web Scraper

I am looking for a way to do a small preview of another page from a user-defined URL in PHP .

I would like to get only the page name, image (for example, the site logo) and some text or description, if available. Is there an easy way to do this without any external libraries / classes? Thanks

So far I have been trying to use the DOCDocument class, load the HTML and display it on the screen, but I don't think the right way to do this

+9
source share
4 answers

I recommend you consider simple_html_dom . It will be very easy.

Here is a working example of how to pull out the title and the first image.

<?php require 'simple_html_dom.php'; $html = file_get_html('http://www.google.com/'); $title = $html->find('title', 0); $image = $html->find('img', 0); echo $title->plaintext."<br>\n"; echo $image->src; ?> 

Here is a second example that will do the same without an external library. I should note that using regex in HTML is NOT a good idea.

 <?php $data = file_get_contents('http://www.google.com/'); preg_match('/<title>([^<]+)<\/title>/i', $data, $matches); $title = $matches[1]; preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches); $img = $matches[1]; echo $title."<br>\n"; echo $img; ?> 
+17
source

You can use SimpleHtmlDom for this. and then find the title and img tags or what else you need to do.

+2
source

You can use any of these libraries. As you know, each has its own pros and cons, so you can consult with notes about each or take the time and try it yourself:

  • Guzzle : An independent HTTP client, so you don’t need to depend on cURL, SOAP or REST.
  • Goutte : Built on Guzzle and some Symfony components from Symfony developer.
  • hQuery : A fast scraper with caching capabilities. high performance when cleaning large documents.
  • Requests : Famous for convenient use.
  • Buzz : A lightweight client perfect for beginners.
  • ReactPHP : Asynchronous scraper, with detailed tutorials and examples.

It’s best to check and use them at best.

0
source

https://www.moneycontrol.com/india/stockpricequote/cement-major/rainindustries/RC12 from this site, how to get only the following items

Consolidated Autonomous Consolidated MARKET CAP (RS CR) 4,493.58 P / E4.37 BOOK VALUE (RS) 120.28 DIV (%) 100.00% MARKET LOT1 INDUSTRY P / E35.89 EPS (TTM) 30.56 P / C2.89 PRICE / BOOK1.11 DIV YIELD. (%) 1.50% FACE VALUE (RS) 2.00 DELIVERABLES (%) 48.65 new * Note - Trailing EPS is displayed only when the latest results for 4 quarters are available.

0
source

Source: https://habr.com/ru/post/911360/


All Articles