Inspired by the appreciation of @schmichael from the Python functional solution, here is my attempt to push things too much. I do not claim that it is supported, efficient, illustrative or verified, but it is functional:
from itertools import imap, groupby, izip, chain from collections import deque from operator import itemgetter, methodcaller from functools import partial def shifty_csv_dicts(lines): last = lambda seq: deque(seq, maxlen=1).pop() parse_header = lambda header: header[1:-1].split(',') parse_row = lambda row: row.rstrip('\n').split(',') mkdict = lambda keys, vals: dict(izip(keys,vals)) headers_then_rows = imap(itemgetter(1), groupby(lines, methodcaller('startswith', '#'))) return chain.from_iterable(imap(partial(mkdict, parse_header(last(headers))), imap(parse_row, next(headers_then_rows))) for headers in headers_then_rows)
Ok, let it unpack it.
The basic idea is to (ab) use itertools.groupby to recognize changes from headers to data rows. We use argument evaluation semantics to control the order of operations.
First we say groupby to group the lines, regardless of whether they start with '#' or not:
methodcaller('startswith', '#')
creates a function that takes a line and calls line.startswith('#') (it is equivalent to the stylistically preferred but less efficient lambda line: line.startswith('#') ).
Thus, groupby takes an incoming iterative of lines and alternates between returning an iteration of the header lines (usually only one header) and the iterated data line. It actually returns a tuple (group_val, group_iter) , where in this case group_val is a bool indicating whether this is a header. So, we execute the equivalent (group_val, group_iter)[1] on all tuples to select iterators: itemgetter(1) is just a function that runs " [1] " on everything you give it (again, equivalent, but more efficient than lambda t: t[1] ). So we use imap to run our function itemgetter each tuple returned groupby , iterators to select a title / data:
imap(itemgetter(1), groupby(lines, methodcaller('startswith', '#')))
We evaluate this expression first and give it a name because we will use it twice later, first for headers and then for data. The most external call:
chain.from_iterable(... for headers in headers_then_rows)
goes through iterators returned from groupby . We are tricky and call the headers value because some other code inside ... selects rows when we are not looking, promoting the groupby iterator in the process. This expression of the external generator will produce only headers (remember, they change: headers, data, headers, data ...). The trick is to make sure the headers are consumed before the lines, because they both have the same main iterator. chain.from_iterable simply stitches the results of all iterators of data rows into one iterator to return them to everyone.
So what do we sew together? Well, we need to take the (last) heading, button it with each row of values ββand make dicts out of it. It:
last = lambda seq: deque(seq, maxlen=1).pop()
is a somewhat dirty but effective hack to get the last element from an iterator, in this case, our title bar. Then we parse the header by trimming the leading # and ending newline and dividing by, to get a list of column names:
parse_header = lambda header: header[1:-1].split(',')
But we want to do this only once for each row iterator, because it runs out of the header iterator (and now we donβt want to copy it to some kind of mutable state, right?). We also need to make sure that the header iterator is used before the lines. The solution is to make the function partially applied, evaluating and fixing the headers as the first parameter, and taking the string as the second parameter:
partial(mkdict, parse_header(last(headers)))
The mkdict function uses column names as keys and row data as values ββto create a dict:
mkdict = lambda keys, vals: dict(izip(keys,vals))
This gives us a function that freezes the first parameter ( keys ) and allows us to simply pass the second parameter ( vals ): exactly what we need to create a group of dictons with the same keys and different values.
To use it, we analyze each line as you expected:
parse_row = lambda row: row.rstrip('\n').split(',')
recalling that next(headers_then_rows) will return an iterator of data rows from groupby (since we already used a header iterator):
imap(parse_row, next(headers_then_rows))
Finally, we map our partially applied dict-maker function to the parsed strings:
imap(partial(...), imap(parse_row, next(headers_then_rows)))
And all of them are sewn using chain.from_iterable , to make one, big, happy, functional flow sliding CSV-dictates.
For the record, this can probably be simplified, and I will still be doing @schmichael stuff. But I understood how to understand this, and I will try to apply these ideas to the Scala solution.