Python web crawler with MySQL database

I want to create or find an open source search robot (spider / bot) written in Python. It must find and track links, collect meta tags and meta descriptions, the name of the web pages and the URL of the web page, and put all the data in the MySQL database.

Does anyone know of any open source scripts that can help me? Also, if someone can give me some guidance on what I should do, then they are more than welcome.

+6
source share
3 answers

yes I know,

libraries

https://github.com/djay/transmogrify.webcrawler

http://code.google.com/p/harvestman-crawler/

http://code.activestate.com/pypm/orchid/

open source scanner

http://scrapy.org/

textbooks

http://www.example-code.com/python/pythonspider.asp

PS I don’t know if they use mysql, because usually python uses sqlit or postgre sql, so if you want you can use the libraries that I gave you and import the python-mysql module and do this: D

http://sourceforge.net/projects/mysql-python/

+4
source

I suggest you use Scrapy , which is a powerful cleaning environment based on Twisted and lxml . It is particularly well suited for the tasks you want to perform, it uses regexp rules to track links, and allows you to use regular expressions or XPath expressions to extract data from html. It also provides what they call “pipelines” to flush data to the position you need.

Scrapy does not provide a built-in MySQL pipeline, but someone wrote here here from which you can create your own.

+4
source

Scrappy is a web crawl and scrambling that you can extend to insert the selected data into the database.

This is similar to the inverse structure of Django.

+3
source

Source: https://habr.com/ru/post/894757/


All Articles