How to get specific nodes in xml file using python

im looking for a way to get specific tags .. from a very large xml document with python dom built-in
eg:

<AssetType longname="characters" shortname="chr" shortnames="chrs"> <type> pub </type> <type> geo </type> <type> rig </type> </AssetType> <AssetType longname="camera" shortname="cam" shortnames="cams"> <type> cam1 </type> <type> cam2 </type> <type> cam4 </type> </AssetType> 

I want to get the value of the children of the AssetType node, who has the attribute (longname = "characters") to get the result 'pub','geo','rig'
please keep in mind that I have over 1000 <AssetType> nodes
thanx in advance

+4
source share
6 answers

If you do not mind loading the entire document into memory:

 from lxml import etree data = etree.parse(fname) result = [node.text.strip() for node in data.xpath("//AssetType[@longname='characters']/type")] 

You may need to remove the spaces at the beginning of your tags to make this work.

+2
source

Assuming your document is called assets.xml and has the following structure:

 <assets> <AssetType> ... </AssetType> <AssetType> ... </AssetType> </assets> 

Then you can do the following:

 from xml.etree.ElementTree import ElementTree tree = ElementTree() root = tree.parse("assets.xml") for assetType in root.findall("//AssetType[@longname='characters']"): for type in assetType.getchildren(): print type.text 
+4
source

You can use the pulldom API to parse a large file without loading it directly into memory. This provides a more convenient interface than using SAX with a slight loss of performance.

This basically allows you to transfer the XML file until you find the bit you are interested in, and then start with regular DOM operations .

 from xml.dom import pulldom # http://mail.python.org/pipermail/xml-sig/2005-March/011022.html def getInnerText(oNode): rc = "" nodelist = oNode.childNodes for node in nodelist: if node.nodeType == node.TEXT_NODE: rc = rc + node.data elif node.nodeType==node.ELEMENT_NODE: rc = rc + getInnerText(node) # recursive !!! elif node.nodeType==node.CDATA_SECTION_NODE: rc = rc + node.data else: # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on pass return rc # xml_file is either a filename or a file stream = pulldom.parse(xml_file) for event, node in stream: if event == "START_ELEMENT" and node.nodeName == "AssetType": if node.getAttribute("longname") == "characters": stream.expandNode(node) # node now contains a mini-dom tree type_nodes = node.getElementsByTagName('type') for type_node in type_nodes: # type_text will have the value of what inside the type text type_text = getInnerText(type_node) 
+3
source

Use xml.sax . Create your own handler and inside startElement you should check if the name is AssetType. Therefore, you should only be able to act when the AssetType node is processed.

Here you have an example handler that shows how to build it (although this is not the most beautiful way, at that moment I didn’t know all the cool Python tricks ;-)).

+2
source

You can use xpath, something like "// AssetType [longname = 'characters'] / xyz".

For XPath libs in Python, see http://www.somebits.com/weblog/tech/python/xpath.html

+1
source

Similar to eswald's solution, again removing spaces, loading the document into memory again, but returning three text elements at a time

 from lxml import etree data = """<AssetType longname="characters" shortname="chr" shortnames="chrs" <type> pub </type> <type> geo </type> <type> rig </type> </AssetType> """ doc = etree.XML(data) for asset in doc.xpath('//AssetType[@longname="characters"]'): threetypes = [ x.strip() for x in asset.xpath('./type/text()') ] print threetypes 
+1
source

Source: https://habr.com/ru/post/1300770/


All Articles