I need to clean about 100 sites that are very similar in the content that they provide.
My first doubt. It should be possible to write a common script to clear all 100 websites or scrambling methods, you can write scripts only for specific sites. (Stupid question.). I think I should ask which opportunity is easier. Writing 100 different scripts for each site is difficult.
Second question. My main language is PHP, but after searching here on Stackoverflow, I found that one of the most modern scraper is "Beautiful Soup" in Python. Should PHP calls be possible in "Beautiful Soup" in Python? Or is it best to do the whole script in Python?
Give me some tips on how to go.
Sorry for the weak English.
Regards,
1.) One scraper for 100 sites? It depends on your requirements. If you need only specific information, you will need to consider 100 different sites and their layouts. However, some common functions may be shared.
2.) BeautifulSoup is an HTML / XML parser, not a screen scraper per se. This would be the best choice for the task if the scraper were written in python. Calling python from php can be done, but it is certainly not as simple as a monolingual solution. That's why I suggest you take a look at python and BeautifulSoup for a prototype.
Sidenote: http://scrapy.org/ - another python library specially designed
- .
PHP, Python, phpQuery -. , scaper, CSS ( SelectorGadget), ->text() .
->text()
, ( ), , . D:
.
1: grep, sed awk. , 2: regex. , HTML-.
3: PHP XML/HTML- DomDocument. , , ( PHP). PHP, PHPQuery, , , .
4: Python BeautifulSoup. BeautifulSoup, , . , Python, BeautifulSoup . .
script . , . - , body title, , , , , , , ?
body title
- RSS- Python - ElementTree, RSS, , . , , HTML.
If we are talking about 100 different sites, try writing an abstraction that works on most of them, and transforms the page into a general data structure that you can work with. Then redefine the parts of the abstraction to process individual sites that are different from the norm.
Scrapers are usually associated with I / O - look at coroutine libraries, such as eventlet or gevent, to use some parallelism I / O and speed up the whole process.
Source: https://habr.com/ru/post/1783237/More articles:Run the script command on the remote server using a Java application authenticating with keytabs keberos - javaдоступ к элементам управления дочерней формой - c#Alternatively show / hide the window when you click the notification icon - c #Export comments from Excel - exportJQuery UI Dialog cannot access another frame - javascriptSending (sequential) break using windows (XP +) api - winapiDynamic li record with jquery. An item cannot be pressed after recording - jqueryProgrammatically assign an existing ssl certificate to a website in iis6 via powershell or vbscript - powershellg ++ Mac communication error compiling FFMPEG - ffmpegWP7 using LoopingSelector in UserControl - windows-phone-7All Articles