Parse malformed attribute using BeautifulSoup

I am trying to extract an attribute containing an invalid unescaped quote:

<meta content="mal"formed">

When using BeautifulSoup:

soup.find('meta')['content']

And as expected, the result will be mal.

Is there a way to make BeautifulSoup consider an unscreened quote as part of the attribute, so there will be a result mal"formed?

+4
source share
2 answers

Here is what I was trying to fix that broken HTML:

  • Different parsersBeautifulSoup - html.parser, html5lib,lxml
  • lxml.html with and without recover=True

    from lxml.html import HTMLParser, fromstring, tostring
    
    data = """<meta content="mal"formed">"""
    
    parser = HTMLParser(recover=True)
    print tostring(fromstring(data, parser=parser))
    

    Print

    <html><head><meta content="mal" formed></head></html>
    
  • run Firefoxand Chromethrough seleniumand submit them the broken meta tag:

    from selenium import webdriver
    
    data = """<meta content="mal"formed">"""
    
    driver = webdriver.Chrome()  # or webdriver.Firefox
    driver.get("about:blank")
    
    driver.execute_script("document.head.innerHTML = '{html}';".format(html=data))
    data = driver.page_source
    driver.close()
    
    print data
    

    Print

    <html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="mal" formed"="" /></head><body></body></html>
    

HTML -, .

, , , .

0

regex :

html = re.sub('(content="[^"=]+)"([^"=]+")', r'\1&quot;\2', html)
soup = BeautifulSoup(html)    
soup.find('meta')['content']

: . str(element) BeautifulSoup html, html, formed () .

, HTML regex. , .

(, , ) .

0

Source: https://habr.com/ru/post/1620639/


All Articles