Python Scrapy - populate start_urls from mysql

Question

Python Scrapy - populate start_urls from mysql

I am trying to populate start_url with SELECT from MYSQL table using spider.py . When I run "scrapy runpider spider.py", I do not get the output, it just finished without errors.

I checked the SELECT query in a python script, and start_url populated the records from the MYSQL table.

spider.py

from scrapy.spider import BaseSpider from scrapy.selector import Selector import MySQLdb class ProductsSpider(BaseSpider): name = "Products" allowed_domains = ["test.com"] start_urls = [] def parse(self, response): print self.start_urls def populate_start_urls(self, url): conn = MySQLdb.connect( user='user', passwd='password', db='scrapy', host='localhost', charset="utf8", use_unicode=True ) cursor = conn.cursor() cursor.execute( 'SELECT url FROM links;' ) rows = cursor.fetchall() for row in rows: start_urls.append(row[0]) conn.close()

+6

python mysql web-crawler scrapy

maryo Nov 21 '13 at 10:45

source share

2 answers

Write a check in __init__ :

 def __init__(self): super(ProductsSpider,self).__init__() self.start_urls = get_start_urls()

Assuming get_start_urls() returns the urls.

+4

Biswanath Nov 21 '13 at 15:20

source share

Shane evans · Accepted Answer · 2013-11-22T04:43:19+0000

A better approach is to override the start_requests method.

This can query your database, like populate_start_urls , and return a sequence of Request objects.

You just need to rename your populate_start_urls method to start_requests and change the following lines:

 for row in rows: yield self.make_requests_from_url(row[0])

Python Scrapy - populate start_urls from mysql

More articles: