I am newbie.
This is my spider:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from ampa.items import AmpaItem class AmpaSpider(CrawlSpider): name = "ampa" allowed_domains = ['website'] start_urls = ['website/page'] rules = (Rule(SgmlLinkExtractor(allow=('associados?', ), deny=('associado/', )), callback='parse_page', follow=True),) def parse_page(self, response): hxs = HtmlXPathSelector(response) item = AmpaItem() farmers = hxs.select('//div[@class="span-24 tx_left"]') item['nome'] = farmers.select('//div/h3[@class="titulo"]/a/text()').extract() item['phone'] = farmers.select('//div/span[@class="chamada"]/a[contains(text(), "Telefone")]/text()').extract() item['email'] = farmers.select('//div/span[@class="chamada"]/a[contains(text(), "E-mail")]/text()').extract() print item.values() return item
This is my pipeline:
class CsvWriterPipeline(object): def __init__(self): self.csvwriter = csv.writer(open('items.csv', 'wb')) def process_item(self, item, ampa): self.csvwriter.writerow([item['nome'], item['phone'], item['email']]) return item
Each page of the website contains a list of names, phone numbers and emails. In the above code, a csv file with three columns and one row for each page will be output. In the first column, each cell is a list of all the names on this page, in the second column they are a list of all phones, and in the third column they are a list of all email messages.
What I really want to do is have each name, phone and email in separate lines. I tried to do this by going through each item, but it only prints the first name, phone and email on each page. (This is because the callback moves the crawler to the next URL every time the spider function returns an element) (Does it ???)
How would you do that?
Here is the item:
from scrapy.item import Item, Field class AmpaItem(Item): nome = Field() phone = Field() email = Field()
source share