Create pandas DataFrame from generator?

I created a tuple generator that extracts information from a file, filtering only the records of interest and converting them to the tuple that the generator returns.

I am trying to create a DataFrame from:

import pandas as pd df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list) 

but gives an error:

 ... C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows) 1046 values.append(row) 1047 i += 1 -> 1048 if i >= nrows: 1049 break 1050 TypeError: unorderable types: int() >= NoneType() 

I managed to work by consuming the generator in the list, but using twice the memory:

 df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list) 

The files I want to download are large, and memory consumption matters. The last attempt of my computer spends two hours trying to increase virtual memory :(

Question: Does anyone know a way to create a DataFrame from a record generator directly, without first converting it to a list?

Note. I am using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

This is not a problem reading the file, my tuple generator does it well, it scans a text compressed file of mixed records line by line and converts only the necessary data to the correct types, then it gives the fields in the tuple generator form. Some numbers, it scans 2111412 entries in a 130 MB file, about 6.5 GB without compression, after about a minute and with a little memory.

Pandas 0.12 does not allow generators; the dev version allows this, but put the entire generator in a list and then convert to a frame. This is inefficient, but it is something that needs to be done inside pandas. Meanwhile, I have to think about buying some more memory.

+17
python pandas
Sep 20 '13 at 11:42 on
source share
4 answers

You cannot create a DataFrame from a generator with version <0.4 โ†’ 0.12. You can either upgrade yourself to the development version (get it from github and compile it, which is a little painful for windows, but I would prefer this option).

Or you can, since you said that you are filtering the lines, first filter them, write them to a file and load them using read_csv or something else ...

If you want a super-sophisticated way, you can create a file that looks like an object that returns strings:

 def gen(): lines = [ 'col1,col2\n', 'foo,bar\n', 'foo,baz\n', 'bar,baz\n' ] for line in lines: yield line class Reader(object): def __init__(self, g): self.g = g def read(self, n=0): try: return next(self.g) except StopIteration: return '' 

And then use read_csv :

 >>> pd.read_csv(Reader(gen())) col1 col2 0 foo bar 1 foo baz 2 bar baz 
+11
Sep 20 '13 at 12:09 on
source share

To get effective memory, read the pieces. Something like this using the Victor Reader class on top.

 df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1) 
+5
Sep 20 '13 at 13:10
source share

You can of course build pandas.DataFrame() from a tuple generator starting in version 19 (and possibly earlier). Do not use .from_records() ; just use the constructor, for example:

 import pandas as pd someGenerator = ( (x, chr(x)) for x in range(48,127) ) someDf = pd.DataFrame(someGenerator) 

It produces:

 type(someDf) #pandas.core.frame.DataFrame someDf.dtypes #0 int64 #1 object #dtype: object someDf.tail(10) # 0 1 #69 117 u #70 118 v #71 119 w #72 120 x #73 121 y #74 122 z #75 123 { #76 124 | #77 125 } #78 126 ~ 
+2
Apr 27 '17 at 15:04 on
source share

You can also use something like (Python checked in version 2.7.5)

 from itertools import izip def dataframe_from_row_iterator(row_iterator, colnames): col_iterator = izip(*row_iterator) return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)}) 

You can also adapt this to add rows to a DataFrame.

- Edit, December 4th: s / row / rows in the last row

-one
Oct 29 '13 at 18:26
source share



All Articles