I'm currently trying to iteratively parse a very large HTML document (I know .. yuck) to reduce the amount of memory used. The problem I am facing is that I am getting XML syntax errors, such as:
lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59
In this case, everything stops.
Is there a way to iterate through HTML without strangling syntax errors?
I am currently extracting a line number from an XML syntax error exception by deleting that line from the document and restarting the process. This seems to be a rather disgusting decision. Is there a better way?
Edit:
This is what I am doing now:
context = etree.iterparse(tfile, events=('start', 'end'), html=True) in_table = False header_row = True while context: try: event, el = context.next() # do something # remove old elements while el.getprevious() is not None: del el.getparent()[0] except etree.XMLSyntaxError, e: print e.msg lineno = int(re.search(r'line (\d+),', e.msg).group(1)) remove_line(tfilename, lineno) tfile = open(tfilename) context = etree.iterparse(tfile, events=('start', 'end'), html=True) except KeyError: print 'oops keyerror'
Acorn source share