I am a Python beginner and this is my first question. I am trying to clear the names and URLs from this Wikipedia page and save them in .json. Preliminary xpath testing in scrapy shell with
sel.xpath('//tr[@style="vertical-align: top;"]//a')
the terminal displays the urls and headers in the tags, but the full script returns a .json file with empty lines like this:
[{"url": [], "title": []},
{"url": [], "title": []},
{"url": [], "title": []},
{"url": [], "title": []}],
Here is the whole script:
from scrapy.spider import Spider
from scrapy.selector import Selector
from wikipedia.items import WikipediaItem
class MySpider(Spider):
name = "wiki"
allowed_domains = ["en.wikipedia.org"]
start_urls = [
"http://en.wikipedia.org/wiki/Category:2013_films"
]
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//tr[@style="vertical-align: top;"]//li')
items = []
for title in titles:
item = WikipediaItem()
item["url"] = sel.xpath("a/@href").extract()
item["title"] = sel.xpath("a/text()").extract()
items.append(item)
return(items)
If you can give me a clue where I am doing wrong, it would be just great. Thank!
source
share