Python - lxml removes some xml tags and moves others

I am trying to open xml, remove whole tags and their contents and move other tags inside xml.

Here is my original xml import:

<?xml version="1.0" encoding="UTF-8"?> <package> <language>en-GB</language> <video> <original_spoken_locale>en-US</original_spoken_locale> <copyright_cline>2012 copyright</copyright_cline> <release_date>2012-04-23</release_date> <title>Amazing Film</title> </video> <provider>testprovider</provider> </package> 

I need to remove the <copyright_cline> tag and the <title> . Then I need to move the <provider> tag to the <video> and place it under the <original_spoken_locale> , and also move the <release_date> tag below the <video> .

Here is the xml export given:

 <?xml version="1.0" encoding="UTF-8"?> <package> <language>en-GB</language> <video> <original_spoken_locale>en-US</original_spoken_locale> <provider>testprovider</provider> <release_date>2012-04-23</release_date> </video> <release_date>2012-04-23</release_date> </package> 

Now I have successfully installed lxml, so ideally looking for a solution.

Sincerely.


I managed to remove unnecessary tags and their contents, but still I need to be able to re-order / move other tags around, preferably without replacement. I also have a problem removing this xml "

 <!--Carpet ID: fd54678--> 

Here is what I have now:

 from lxml import etree xmlFileIn = '/xmls/metadata.xml' xmlFileOut = '/xmls/output.xml' tree = etree.parse(xmlFileIn) root = tree.getroot() etree.strip_elements(root, 'assets') etree.strip_tags(root, 'assets') etree.strip_elements(root, 'chapters') etree.strip_tags(root, 'chapters') etree.strip_elements(root, 'xid') etree.strip_tags(root, 'xid') # Write the new xml file tree.write(xmlFileOut, pretty_print=True, xml_declaration=True, encoding="utf-8") 

Therefore, I still need to remove the <!--Carpet ID: fd54678--> . I want to remove them with wildcards, since there are many <!--.*--> , since the contents in the middle will change. and I also need to know how to move tag blocks around.

+4
source share
1 answer

Since no one has answered yet, I will try; but I'm going to read, not experiment. I apologize in advance if I missed something ....

How to move elements, see Move an entire element using lxml.etree

As noted there, be especially careful as text nodes are not nodes in lxml (see below).

As for comments, I could not find any way in lxml to get comments or to directly "move" elements. You can strip them of sed or something in the first place.

Special

Elementtree and therefore lxml seem to be carried away by just one kind of node. This has several consequences that can be problematic ("Everything should be as simple as possible, but not simpler"):

  • Working with comments (as in this case) or PI is more difficult because they are not first-class concepts in the model.

  • The text is particularly complex because lxml and elementtree make the text following the end tag of any XML element into the property of that element (tail). It was considered as on par with that type-type, attributes and child elements. It may be kind of work (this is a Turing machine, you know), but it requires a completely different way of thinking.

I noticed that lxml writers often state that this is mainly for XML structures that actually don't have a lot of text. The example you gave seems like this; if so you are lucky. But when the text matters, even something simple:

  <p>As everyone<footnote>Well, almost everyone</footnote> knows...</p> 

the text "knows ..." is part of the <footnote> node in lxml. When you move or delete or replace a footnote, the text goes with it. But, of course, this text is not part of the footnote (this happened after the footnote finally ended).

I don’t know what lxml does with Like Like - this does not happen immediately after the end of any element. I could not find anything about how lxml handles this.

Therefore, be very careful if there is text content anywhere.

+1
source

Source: https://habr.com/ru/post/1484678/


All Articles