Iterative HTML analysis (using lxml?)

I'm currently trying to iteratively parse a very large HTML document (I know .. yuck) to reduce the amount of memory used. The problem I am facing is that I am getting XML syntax errors, such as:

lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59

In this case, everything stops.

Is there a way to iterate through HTML without strangling syntax errors?

I am currently extracting a line number from an XML syntax error exception by deleting that line from the document and restarting the process. This seems to be a rather disgusting decision. Is there a better way?

Edit:

This is what I am doing now:

 context = etree.iterparse(tfile, events=('start', 'end'), html=True) in_table = False header_row = True while context: try: event, el = context.next() # do something # remove old elements while el.getprevious() is not None: del el.getparent()[0] except etree.XMLSyntaxError, e: print e.msg lineno = int(re.search(r'line (\d+),', e.msg).group(1)) remove_line(tfilename, lineno) tfile = open(tfilename) context = etree.iterparse(tfile, events=('start', 'end'), html=True) except KeyError: print 'oops keyerror' 
+4
source share
4 answers

The perfect solution turned out to be Python's very own HTMLParser [docs] .

This is the (rather bad) code I ended up using:

 class MyParser(HTMLParser): def __init__(self): self.finished = False self.in_table = False self.in_row = False self.in_cell = False self.current_row = [] self.current_cell = '' HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): attrs = dict(attrs) if not self.in_table: if tag == 'table': if ('id' in attrs) and (attrs['id'] == 'dgResult'): self.in_table = True else: if tag == 'tr': self.in_row = True elif tag == 'td': self.in_cell = True elif (tag == 'a') and (len(self.current_row) == 7): url = attrs['href'] self.current_cell = url def handle_endtag(self, tag): if tag == 'tr': if self.in_table: if self.in_row: self.in_row = False print self.current_row self.current_row = [] elif tag == 'td': if self.in_table: if self.in_cell: self.in_cell = False self.current_row.append(self.current_cell.strip()) self.current_cell = '' elif (tag == 'table') and self.in_table: self.finished = True def handle_data(self, data): if not len(self.current_row) == 7: if self.in_cell: self.current_cell += data 

With this code, I could do this:

 parser = MyParser() for line in myfile: parser.feed(line) 
+5
source

Currently, lxml etree.iterparse supports the recover = True keyword argument , so instead of writing a custom HTMLParser subclass to fix the broken html, you can simply pass this argument to iterparse.

To properly parse a huge and broken html, you only need to follow these steps:

 etree.iterparse(tfile, events=('start', 'end'), html=True, recover=True) 
+4
source

Use True for iterparse arguments html and huge_tree .

0
source

Try to parse your lxml.html HTML document:

Starting with version 2.0, lxml comes with a dedicated Python package for working with HTML: lxml.html. It is based on the lxml HTML parser, but provides a special Element API element for HTML elements, as well as a number of utilities for common HTML processing tasks.

-1
source

Source: https://habr.com/ru/post/1385925/


All Articles