Unable to determine the correct loop in radiation therapy

Question

Unable to determine the correct loop in radiation therapy

I am newbie.

This is my spider:

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from ampa.items import AmpaItem class AmpaSpider(CrawlSpider): name = "ampa" allowed_domains = ['website'] start_urls = ['website/page'] rules = (Rule(SgmlLinkExtractor(allow=('associados?', ), deny=('associado/', )), callback='parse_page', follow=True),) def parse_page(self, response): hxs = HtmlXPathSelector(response) item = AmpaItem() farmers = hxs.select('//div[@class="span-24 tx_left"]') item['nome'] = farmers.select('//div/h3[@class="titulo"]/a/text()').extract() item['phone'] = farmers.select('//div/span[@class="chamada"]/a[contains(text(), "Telefone")]/text()').extract() item['email'] = farmers.select('//div/span[@class="chamada"]/a[contains(text(), "E-mail")]/text()').extract() print item.values() return item

This is my pipeline:

 class CsvWriterPipeline(object): def __init__(self): self.csvwriter = csv.writer(open('items.csv', 'wb')) def process_item(self, item, ampa): self.csvwriter.writerow([item['nome'], item['phone'], item['email']]) return item

Each page of the website contains a list of names, phone numbers and emails. In the above code, a csv file with three columns and one row for each page will be output. In the first column, each cell is a list of all the names on this page, in the second column they are a list of all phones, and in the third column they are a list of all email messages.

What I really want to do is have each name, phone and email in separate lines. I tried to do this by going through each item, but it only prints the first name, phone and email on each page. (This is because the callback moves the crawler to the next URL every time the spider function returns an element) (Does it ???)

How would you do that?

Here is the item:

 from scrapy.item import Item, Field class AmpaItem(Item): nome = Field() phone = Field() email = Field()

+4

python scrapy

Theo Sep 13 '12 at 17:58

source share

2 answers

Based on the plural usage of farmes , I assume there are many farmers on the page. Thus, you are likely to return the collection of farmers.

Can you view the results of farmers and get each item?

 #pseudocode hxs = HtmlXPathSelector(response) farmers = hxs.select('//div[@class="span-24 tx_left"]') for farmer in farmer: item = AmpaItem() #be sure to select only one desired farmer here item['nome'] = farmers.select('//div/h3[@class="titulo"]/a/text()').extract() item['phone'] = farmers.select('//div/span[@class="chamada"]/a[contains(text(), "Telefone")]/text()').extract() item['email'] = farmers.select('//div/span[@class="chamada"]/a[contains(text(), "E-mail")]/text()').extract() yield item

0

Drcolosos Sep 14 '12 at 8:48

source share

Theo · Accepted Answer · 2012-09-14T21:00:27+0000

I found a solution by changing my pipeline:

 import csv import itertools class CsvWriterPipeline(object): def __init__(self): self.csvwriter = csv.writer(open('items.csv', 'wb'), delimiter=',') def process_item(self, item, ampa): for i,n,k in itertools.izip(item['nome'],item['phone'],item['email']): self.csvwriter.writerow([i,n,k]) return item

Thaks DrColossos and dm03514 !!

This was my first stackoverflow question!

Unable to determine the correct loop in radiation therapy

More articles: