I suggest you use Scrapy , which is a powerful cleaning environment based on Twisted and lxml . It is particularly well suited for the tasks you want to perform, it uses regexp rules to track links, and allows you to use regular expressions or XPath expressions to extract data from html. It also provides what they call “pipelines” to flush data to the position you need.
Scrapy does not provide a built-in MySQL pipeline, but someone wrote here here from which you can create your own.
source share