I created a tuple generator that extracts information from a file, filtering only the records of interest and converting them to the tuple that the generator returns.
I am trying to create a DataFrame from:
import pandas as pd df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)
but gives an error:
... C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows) 1046 values.append(row) 1047 i += 1 -> 1048 if i >= nrows: 1049 break 1050 TypeError: unorderable types: int() >= NoneType()
I managed to work by consuming the generator in the list, but using twice the memory:
df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)
The files I want to download are large, and memory consumption matters. The last attempt of my computer spends two hours trying to increase virtual memory :(
Question: Does anyone know a way to create a DataFrame from a record generator directly, without first converting it to a list?
Note. I am using python 3.3 and pandas 0.12 with Anaconda on Windows.
Update:
This is not a problem reading the file, my tuple generator does it well, it scans a text compressed file of mixed records line by line and converts only the necessary data to the correct types, then it gives the fields in the tuple generator form. Some numbers, it scans 2111412 entries in a 130 MB file, about 6.5 GB without compression, after about a minute and with a little memory.
Pandas 0.12 does not allow generators; the dev version allows this, but put the entire generator in a list and then convert to a frame. This is inefficient, but it is something that needs to be done inside pandas. Meanwhile, I have to think about buying some more memory.