Scrapy and Wikipedia getting blank lines

Question

Scrapy and Wikipedia getting blank lines

I am a Python beginner and this is my first question. I am trying to clear the names and URLs from this Wikipedia page and save them in .json. Preliminary xpath testing in scrapy shell with

sel.xpath('//tr[@style="vertical-align: top;"]//a')

the terminal displays the urls and headers in the tags, but the full script returns a .json file with empty lines like this:

[{"url": [], "title": []},
{"url": [], "title": []},
{"url": [], "title": []},
{"url": [], "title": []}],

Here is the whole script:

from scrapy.spider import Spider
from scrapy.selector import Selector

from wikipedia.items import WikipediaItem


class MySpider(Spider):
    name = "wiki"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = [
        "http://en.wikipedia.org/wiki/Category:2013_films"
        ]

    def parse(self, response):
        sel = Selector(response)
        titles = sel.xpath('//tr[@style="vertical-align: top;"]//li')
        items = []
        for title in titles:
            item = WikipediaItem()
            item["url"] = sel.xpath("a/@href").extract()
            item["title"] = sel.xpath("a/text()").extract()
            items.append(item)  
        return(items)

If you can give me a clue where I am doing wrong, it would be just great. Thank!

+4

python-2.7 xpath scrapy

ikechi Mar 07 '14 at 14:12

source share

1 answer

alecxe · Accepted Answer · 2014-03-07T14:20:08+0000

Here's the fixed method parse():

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//tr[@style="vertical-align: top;"]//li')
    for title in titles:
        item = WikipediaItem()
        link = title.xpath("a")
        if link:
            item["url"] = link.xpath("@href").extract()[0]
            item["title"] = link.xpath("text()").extract()[0]
        yield item

prints:

{'title': u'+1 (film)', 'url': u'/wiki/%2B1_(film)'}
{'title': u'10,000 Hours', 'url': u'/wiki/10,000_Hours'}
...

, , , - , xpath title sel.

, , , yield .

, .

Scrapy and Wikipedia getting blank lines

More articles: