Formatting text output with Scrapy in Python

Question

Formatting text output with Scrapy in Python

I am trying to clear pages with a Scrapy spider and then save these pages to a .txt file in readable form. The code I use for this:

def parse_item(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) title = hxs.select('/html/head/title/text()').extract() content = hxs.select('//*[@id="content"]').extract() texts = "%s\n\n%s" % (title, content) soup = BeautifulSoup(''.join(texts)) strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) filly = open(filename, "w") filly.write(strip)

I combined BeautifulSoup here because the main text contains a lot of HTML that I don’t want in the final product (primarily links), so I use BS to highlight the HTML code and leave only the text that is of interest.

It gives me a conclusion that looks like

 [u"School, Chandler Ford (Hansard, 30 November 1961)"] [u' \n \n HC Deb 30 November 1961 vol 650 cc608-9 \n 608 \n \n \n \n \xa7 \n 28. \n Dr. King \n \n asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\ Ford; and why he refused permission to acquire this site in 1954.\n \n \n \n \n \n \n \n \xa7 \n Sir D. Eccles \n \n I understand that the authority has paid \xa375,000 for this site.\n \n

For now, I want the result to look like this:

  School, Chandler Ford (Hansard, 30 November 1961) HC Deb 30 November 1961 vol 650 cc608-9 608 28. Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler Ford; and why he refused permission to acquire this site in 1954. Sir D. Eccles I understand that the authority has paid £375,000 for this site.

So, I'm basically looking for how to remove the newline indicators \n , tighten everything and convert any special characters to their normal format.

+4

python text web-scraping scrapy

user1074057 Dec 18 '11 at 4:45

source share

1 answer

reclosedev · Accepted Answer · 2011-12-18T07:15:21+0000

My answer in the comments for the code:

 import re import codecs #... #... #extract() returns list, so you need to take first element title = hxs.select('/html/head/title/text()').extract() [0] content = hxs.select('//*[@id="content"]') #instead of using BeautifulSoup for this task, you can use folowing content = content.select('string()').extract()[0] #simply delete duplicating spaces and newlines, maybe you need to adjust this expression cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE) texts = "%s\n\n%s" % (title, cleaned_content) #look like typo in filename creation #filename .... #and my preferable way to write file with encoding with codecs.open(filename, 'w', encoding='utf-8') as output: output.write(texts)

Formatting text output with Scrapy in Python

More articles: