Webscraping Methods Using PHP or Python

I need to clean about 100 sites that are very similar in the content that they provide.

My first doubt. It should be possible to write a common script to clear all 100 websites or scrambling methods, you can write scripts only for specific sites. (Stupid question.). I think I should ask which opportunity is easier. Writing 100 different scripts for each site is difficult.

Second question. My main language is PHP, but after searching here on Stackoverflow, I found that one of the most modern scraper is "Beautiful Soup" in Python. Should PHP calls be possible in "Beautiful Soup" in Python? Or is it best to do the whole script in Python?

Give me some tips on how to go.

Sorry for the weak English.

Regards,

+3
source share
4 answers

1.) One scraper for 100 sites? It depends on your requirements. If you need only specific information, you will need to consider 100 different sites and their layouts. However, some common functions may be shared.

2.) BeautifulSoup is an HTML / XML parser, not a screen scraper per se. This would be the best choice for the task if the scraper were written in python. Calling python from php can be done, but it is certainly not as simple as a monolingual solution. That's why I suggest you take a look at python and BeautifulSoup for a prototype.

Sidenote: http://scrapy.org/ - another python library specially designed

- .

+2

PHP, Python, phpQuery -. , scaper, CSS ( SelectorGadget), ->text() .

, ( ), , . D:

+2

.

1: grep, sed awk. , 2: regex. , HTML-.

3: PHP XML/HTML- DomDocument. , , ( PHP). PHP, PHPQuery, , , .

4: Python BeautifulSoup. BeautifulSoup, , . , Python, BeautifulSoup . .

script . , . - , body title, , , , , , , ?

0

- RSS- Python - ElementTree, RSS, , . , , HTML.

If we are talking about 100 different sites, try writing an abstraction that works on most of them, and transforms the page into a general data structure that you can work with. Then redefine the parts of the abstraction to process individual sites that are different from the norm.

Scrapers are usually associated with I / O - look at coroutine libraries, such as eventlet or gevent, to use some parallelism I / O and speed up the whole process.

0
source

Source: https://habr.com/ru/post/1783237/


All Articles