Parse multiple XML declarations in a single file using lxml.etree.iterparse

I need to parse a file that contains various XML files, i.e. <xml> </xml> <& XML GT; </XML> .. and so on. When using etree.iterparse, I get the following (correct) error:

lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document 

Now I can pre-process the input file and create a separate file for each contained XML file. This may be the easiest solution. But I am wondering if there is a suitable solution for this "problem".

Thanks!

+4
source share
2 answers

The example data you provided indicates one problem, while the question and exception you provided suggest another. Do you have several XML documents combined together, each with its own XML declaration, or do you have an XML fragment with several top-level elements?

If this is the first, then the solution will include splitting the input stream into several streams and parsing each of them individually. This does not necessarily mean, as one comment suggests, implement an XML parser. You can search for a string for XML declarations without having to parse anything else in it if your input does not include CDATA sections containing unscreened XML declarations. You can write a file-like object that returns characters from the base stream until it gets into the XML declaration, and then wrap it in a generator function that stores the returned streams until EOF is reached. This is not trivial, but it is not too difficult.

If you have an XML fragment with several top-level elements, you can simply wrap them in an XML element and parse all of this.

Of course, as with most problems with incorrect XML input, the easiest solution is to fix what creates the bad input.

+3
source

I used regex to solve this problem. Suppose the data is a string containing your many XML documents, and this handle is a function that will do something with each document. After this loop is completed, the data will be empty or contain an incomplete XML document, and the descriptor function will be called zero or more times.

 while True: match = re.match (r''' \s* # ignore leading whitespace ( # start first group <(?P<TAG>\S+).*?> # opening tag (with optional attributes) .*? # stuff in the middle </(?P=TAG)> # closing tag ) # end of first xml document (?P<REM>.*) # anything else ''', data, re.DOTALL | re.VERBOSE) if not match: break document = match.group (1) handle (document) data = match.group ('REM') 
0
source

Source: https://habr.com/ru/post/1347927/


All Articles