I am trying to remove all project1 nodes (along with my children) from the sample XML document below (source document is about 30 GB) using the SAX parser. It would be great to have a separate modified file or ok with built-in editing.
sample.xml
<ROOT> <test src="http://dfs.com">Hi</test> <project1>This is old data<foo></foo></project1> <bar> <project1>ty</project1> <foo></foo> </bar> </ROOT>
Here is my attempt.
parser.py
from xml.sax.handler import ContentHandler import xml.sax class MyHandler(xml.sax.handler.ContentHandler): def __init__(self, out_file): self._charBuffer = [] self._result = [] self._out = open(out_file, 'w') def _createElement(self, name, attrs): attributes = attrs.items() if attributes: out = '' for key, value in attributes: out += ' {}={}'.format(key, value) return '<{}{}>'.format(name, out) return '<{}>'.format(name) def _getCharacterData(self): data = ''.join(self._charBuffer).strip() self._charBuffer = [] self._out.write(data.strip())
I can not make it work.
source share