How to parse indentation hierarchy with python

Question

How to parse indentation hierarchy with python

I have an accounting tree that is indented / spaces in the source:

Income Revenue IAP Ads Other-Income Expenses Developers In-house Contractors Advertising Other Expenses

There is a fixed number of levels, so I would like to smooth the hierarchy using 3 fields (the actual data has 6 levels, simplified, for example):

 L1 L2 L3 Income Income Revenue Income Revenue IAP Income Revenue Ads Income Other-Income Expenses Developers In-house ... etc

I can do this by checking the number of spaces before the account name:

 for rownum in range(6,ws.max_row+1): accountName = str(ws.cell(row=rownum,column=1).value) indent = len(accountName) - len(accountName.lstrip(' ')) if indent == 0: l1 = accountName l2 = '' l3 = '' elif indent == 3: l2 = accountName l3 = '' else: l3 = accountName w.writerow([l1,l2,l3])

Is there a more flexible way to achieve this, based on the indentation of the current line compared to the previous line, rather than assuming that it always has 3 spaces per level? L1 always has no indentation, and we can believe that lower levels will retreat further than their parent, but perhaps not always 3 spaces per level.

The update ended up being logic, because ultimately I need a list of accounts with content, the easiest way is to use indentation to decide whether to reset, add or put a list:

  if indent == 0: accountList = [] accountList.append((indent,accountName)) elif indent > prev_indent: accountList.append((indent,accountName)) elif indent <= prev_indent: max_indent = int(max(accountList,key=itemgetter(0))[0]) while max_indent >= indent: accountList.pop() max_indent = int(max(accountList,key=itemgetter(0))[0]) accountList.append((indent,accountName))

Thus, an account is completed on each line of output.

+5

python

Hart CO Aug 30 '17 at 15:46

source share

2 answers

If the indentation is a fixed number of spaces (there are three spaces here), you can simplify the calculation of the indentation level.

note: I use StringIO to simulate a file

 import io import itertools content = u"""\ Income Revenue IAP Ads Other-Income Expenses Developers In-house Contractors Advertising Other Expenses """ stack = [] for line in io.StringIO(content): content = line.rstrip() # drop \n row = content.split(" ") stack[:] = stack[:len(row) - 1] + [row[-1]] print("\t".join(stack))

You are getting:

 Income Income Revenue Income Revenue IAP Income Revenue Ads Income Other-Income Expenses Expenses Developers Expenses Developers In-house Expenses Developers Contractors Expenses Advertising Expenses Other Expenses

EDIT: indentation is not fixed

If the indentation is not fixed (you do not always have 3 spaces), as in the example below:

 content = u"""\ Income Revenue IAP Ads Other-Income Expenses Developers In-house Contractors Advertising Other Expenses """

You need to evaluate the offset on each new line:

 stack = [] last_indent = u"" for line in io.StringIO(content): indent = "".join(itertools.takewhile(lambda c: c == " ", line)) shift = 0 if indent == last_indent else (-1 if len(indent) < len(last_indent) else 1) index = len(stack) + shift stack[:] = stack[:index - 1] + [line.strip()] last_indent = indent print("\t".join(stack))

+2

Laurent laporte Aug 30 '17 at 16:17

source share

Right leg · Accepted Answer · 2017-08-30T15:54:39+0000

You can imitate how Python actually parses indentation. First create a stack that will contain levels of indentation. In each line:

If the indentation is greater than the top of the stack, click it and increase the depth level.
If this is the same, continue at the same level.
If it is lower, place the top of the stack until it is larger than the new indent. If you find a lower level of indentation before you find the same, then an indented error will appear.

 indentation = [] indentation.append(0) depth = 0 f = open("test.txt", 'r') for line in f: line = line[:-1] content = line.strip() indent = len(line) - len(content) if indent > indentation[-1]: depth += 1 indentation.append(indent) elif indent < indentation[-1]: while indent < indentation[-1]: depth -= 1 indentation.pop() if indent != indentation[-1]: raise RuntimeError("Bad formatting") print(f"{content} (depth: {depth})")

With the file "test.txt", the contents of which are indicated by you:

 Income Revenue IAP Ads Other-Income Expenses Developers In-house Contractors Advertising Other Expenses

Here is the result:

 Income (depth: 0) Revenue (depth: 1) IAP (depth: 2) Ads (depth: 2) Other-Income (depth: 1) Expenses (depth: 0) Developers (depth: 1) In-house (depth: 2) Contractors (depth: 2) Advertising (depth: 1) Other Expense (depth: 1)

So what can you do with this? Suppose you want to create nested lists. First create a data stack.

When you find the indent, add a new list at the end of the data stack.
When you find unindentation, put the top list and add it to the new top.

And independently, for each row, add content to the list at the top of the data stack.

Here is the relevant implementation:

 for line in f: line = line[:-1] content = line.strip() indent = len(line) - len(content) if indent > indentation[-1]: depth += 1 indentation.append(indent) data.append([]) elif indent < indentation[-1]: while indent < indentation[-1]: depth -= 1 indentation.pop() top = data.pop() data[-1].append(top) if indent != indentation[-1]: raise RuntimeError("Bad formatting") data[-1].append(content) while len(data) > 1: top = data.pop() data[-1].append(top)

The nested list is at the top of the data stack. Output for the same file:

 ['Income', ['Revenue', ['IAP', 'Ads' ], 'Other-Income' ], 'Expenses', ['Developers', ['In-house', 'Contractors' ], 'Advertising', 'Other Expense' ] ]

It is fairly easy to manipulate, although fairly deeply embedded. You can access the data through an element access chain:

 >>> l = data[0] >>> l ['Income', ['Revenue', ['IAP', 'Ads'], 'Other-Income'], 'Expenses', ['Developers', ['In-house', 'Contractors'], 'Advertising', 'Other Expense']] >>> l[1] ['Revenue', ['IAP', 'Ads'], 'Other-Income'] >>> l[1][1] ['IAP', 'Ads'] >>> l[1][1][0] 'IAP'

How to parse indentation hierarchy with python

More articles: