To get started, I'm completely new to this, so get ready for some mowed code from copying and pasting from all sources.
I am looking to remove any html that scrapy returns. I have everything that is stored in MySQL without any problems, but the fact that I can’t modify it removes a lot of “<td> 'and other html tags. I initially just started with / text (). Extract (), but randomly it ended up in a cell that was formatted as follows:
<td> <span class="caps">TEXT</span> </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
It does not have a template that I can simply choose between using / text or not, I am looking for the easiest way that a newbie can implement that will disable all this.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
import html2text
from scraper.items import LivingSocialDeal
class CFBDVRB(BaseSpider):
name = "cfbdvrb"
allowed_domains = ["url"]
start_urls = [
"url",
]
deals_list_xpath = '//table[@class="tbl data-table"]/tbody/tr'
item_fields = {
'title': './/td[1]',
'link': './/td[2]',
'location': './/td[3]',
'original_price': './/td[4]',
'price': './/td[5]',
}
def parse(self, response):
selector = HtmlXPathSelector(response)
for deal in selector.xpath(self.deals_list_xpath):
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
converter = html2text.HTML2Text()
converter.ignore_links = True
yield loader.load_item()
= html2text , , , .
, , , - , .