Lxml depth iterparse eTree

I am trying to parse some xml which is in the following format:

<label>
        <name></name>
        <sometag></sometag>
        <sublabels>
            <label></label>
            <label></label>
        </sublabel>
</label>

Parsing this

for event, element in etree.iterparse(gzip.GzipFile(f), events=('end', ), tag='label'):
    if event == 'end':
        name = element.xpath('name/text()')

creates an empty variable name because

<sublabels>
        <label></label>
        <label></label>
</sublabel>

Question:

Is there a way to set the depth of iterparse or ignore the subclass label other than checking if it is empty?

+4
source share
2 answers

This works for me and is inspired by the previous answer:

name = None
level = 0
for event, element in etree.iterparse(gzip.GzipFile(f), events=('end', 'start' ), tag='label'):
    # Update current level
    if event == 'start': level += 1;
    elif event == 'end': level -= 1;
    # Get name for top level label
    if level == 0:
        name = element.xpath('name/text()')

As an alternative solution, parse the entire file and use xpath to get the top label name:

from lxml import html

with gzip.open(f, 'rb') as f:
    file_content = f.read()
    tree = html.fromstring(file_content)
    name = tree.xpath('//label/name/text()')
+3
source

The first thing that came to mind

path = []
for event, element in etree.iterparse(gzip.GzipFile(f), events=('start', 'end')):
    if event == 'start':
        path.append(element.tag)
    elif event == 'end':
        if element.tag == 'label':
            if not 'sublabels' in path:
                name = element.xpath('name/text()')
        path.pop()
0
source

Source: https://habr.com/ru/post/1649516/


All Articles