Are there other ways to implement a visual web scraper besides loading data inside a local iframe?

I saw a video for Portia , and I was thinking about how to implement such a tool. Basically, having a web application into which you entered the URL, it will load (for example, if you download it in a separate browser tab), and then you can click on the elements on the page and visually select the data that you want to extract.

The idea I have now is this:

  • get website content using browser without browser
  • have a route in webapp that will serve scraper content
  • embed a route in an iframe on the data selection page to bypass single-source policies.
  • Integrate some JavaScript Element Inspector library to visually label cleanup elements.
  • create a set of selectors
  • use selectors to retrieve data

I am wondering if there are any other approaches to this, especially parts 1 to 3.

+4
source share
2 answers

Note that the objects you want to clear are probably not active (for example, they do not respond to keystrokes or keystrokes).

Even if they do, they probably won’t handle meta keys like Ctrl or Shift.

, , , , URL- ( ), , Javascript , , .

IFRAME www.your-scraper.com, www.site-to-scrape.com , dab3b19f dab3b19f.your-scraper.com - www.site-to-scrape.com (?) Ctrl-Click.

, , , , Ctrl , , CSS, DIV, DOM , .

, Javascript, . , , (, DOM ).

, , ( , ). , .., . , :

start
click        #menu ul[2] li[1] span
click        .right.sidebar[1] ul[1] li[5] input[type="checkbox"]
click        .right.sidebar[1] ul[1] li[5] button
scrape(TICK) #prices div div[2] div div span p
scrape(PRIC) #prices div div[2] div div span div span[2] p

script , , . .

- Selenium. Selenium .

+2

, .

, , Chrome, , "" ( "" ) , URL- , CSS XPath.

selectorgadget library.

, , ( html, body), id - , FireBug "Copy XPath" "Copy CSS Path".

Scrapy , -. , , . , -, , -.

( ).

+2

Source: https://habr.com/ru/post/1653350/


All Articles