Removing HTML tags without / text (). Extract ()

To get started, I'm completely new to this, so get ready for some mowed code from copying and pasting from all sources.

I am looking to remove any html that scrapy returns. I have everything that is stored in MySQL without any problems, but the fact that I can’t modify it removes a lot of “<td> 'and other html tags. I initially just started with / text (). Extract (), but randomly it ended up in a cell that was formatted as follows:

<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>

It does not have a template that I can simply choose between using / text or not, I am looking for the easiest way that a newbie can implement that will disable all this.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
import html2text
from scraper.items import LivingSocialDeal


class CFBDVRB(BaseSpider):
    name = "cfbdvrb"
    allowed_domains = ["url"]
    start_urls = [
        "url",
    ]

    deals_list_xpath = '//table[@class="tbl data-table"]/tbody/tr'
    item_fields = {
        'title': './/td[1]',
        'link': './/td[2]',
        'location': './/td[3]',
        'original_price': './/td[4]',
        'price': './/td[5]',
    }

    def parse(self, response):
        selector = HtmlXPathSelector(response)

        for deal in selector.xpath(self.deals_list_xpath):
            loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()

            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)

            converter = html2text.HTML2Text()
            converter.ignore_links = True
            yield loader.load_item()

= html2text , , , .

, , , - , .

+4
2

Scrapy w3lib, / Scrapy.

Scrapy (pre 0.22). , , scrapy.utils.markup

my_text, HTML-, :

>>> from w3lib.html import remove_tags
>>> my_text
'<td>    <span class="caps">TEXT</span>  </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>'
>>> remove_tags(my_text)
u'    TEXT  \n    Text    \n    Text    \n    Text    \n    Text    '

/ html/ w3lib ( ).

, , BS4.

+8

- BeautifulSoup. .

, html_text html :

<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>

, htmltags:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'html.parser')
just_text = soup.get_text()

"just_text" :

TEXT
Text
Text
Text

, .

(, Scrapy): BeautifulSoup

!

EDIT:

html:

from bs4 import BeautifulSoup


html_text = """
<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
"""

soup = BeautifulSoup(html_text, 'html.parser')

List_of_tds = soup.findAll('td')

for td_element in List_of_tds:
    print td_element.get_text()

, BeautifulSoup 4, . , , html, , .

0

Source: https://habr.com/ru/post/1612939/


All Articles