I am trying to clear pages with a Scrapy spider and then save these pages to a .txt file in readable form. The code I use for this:
def parse_item(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) title = hxs.select('/html/head/title/text()').extract() content = hxs.select('//*[@id="content"]').extract() texts = "%s\n\n%s" % (title, content) soup = BeautifulSoup(''.join(texts)) strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) filly = open(filename, "w") filly.write(strip)
I combined BeautifulSoup here because the main text contains a lot of HTML that I donโt want in the final product (primarily links), so I use BS to highlight the HTML code and leave only the text that is of interest.
It gives me a conclusion that looks like
[u"School, Chandler Ford (Hansard, 30 November 1961)"] [u' \n \n HC Deb 30 November 1961 vol 650 cc608-9 \n 608 \n \n \n \n \xa7 \n 28. \n Dr. King \n \n asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\ Ford; and why he refused permission to acquire this site in 1954.\n \n \n \n \n \n \n \n \xa7 \n Sir D. Eccles \n \n I understand that the authority has paid \xa375,000 for this site.\n \n
For now, I want the result to look like this:
School, Chandler Ford (Hansard, 30 November 1961) HC Deb 30 November 1961 vol 650 cc608-9 608 28. Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler Ford; and why he refused permission to acquire this site in 1954. Sir D. Eccles I understand that the authority has paid ยฃ375,000 for this site.
So, I'm basically looking for how to remove the newline indicators \n , tighten everything and convert any special characters to their normal format.
source share