Parse malformed attribute using BeautifulSoup

Question

Parse malformed attribute using BeautifulSoup

I am trying to extract an attribute containing an invalid unescaped quote:

<meta content="mal"formed">

When using BeautifulSoup:

soup.find('meta')['content']

And as expected, the result will be mal.

Is there a way to make BeautifulSoup consider an unscreened quote as part of the attribute, so there will be a result mal"formed?

+4

python html beautifulsoup

Tzach Dec 17 '15 at 21:20

source share

2 answers

alecxe · Answer 1 · 2015-12-17T21:49:21+0000

Here is what I was trying to fix that broken HTML:

Different parsersBeautifulSoup - html.parser, html5lib,lxml

lxml.html with and without recover=True

from lxml.html import HTMLParser, fromstring, tostring

data = """<meta content="mal"formed">"""

parser = HTMLParser(recover=True)
print tostring(fromstring(data, parser=parser))

Print

<html><head><meta content="mal" formed></head></html>

run Firefoxand Chromethrough seleniumand submit them the broken meta tag:

from selenium import webdriver

data = """<meta content="mal"formed">"""

driver = webdriver.Chrome()  # or webdriver.Firefox
driver.get("about:blank")

driver.execute_script("document.head.innerHTML = '{html}';".format(html=data))
data = driver.page_source
driver.close()

print data

Print

<html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="mal" formed"="" /></head><body></body></html>

HTML -, .

, , , .

Tzach · Answer 2 · 2015-12-17T22:21:28+0000

regex :

html = re.sub('(content="[^"=]+)"([^"=]+")', r'\1&quot;\2', html)
soup = BeautifulSoup(html)    
soup.find('meta')['content']

: . str(element) BeautifulSoup html, html, formed () .

, HTML regex. , .

(, , ) .

Parse malformed attribute using BeautifulSoup

More articles: