I am trying to combine two data frames in pandas using read_csv. But one of my dataframes ( d1 in this example) is too big for my computer, so I use the iterator argument in read_csv .
Say I have two data frames
d1 = pd.DataFrame({ "col1":[1,2,3,4,5,6,7,8,9], "col2": [5,4,3,2,5,43,2,5,6], "col3": [10,10,10,10,10,4,10,10,10]}, index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"]) d2 = pd.DataFrame({ "yes/no": [1,0,1,0,1,1,1,0,0]}, index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
I need to combine them so that each row captures all the data for each person, so the equivalent is:
pd.concat((d1,d2), axis=1,join="outer")
but since I cannot put d1 in memory, I used read_csv (I use read_csv because I already processed the huge file and saved it in .csv format, so imagine my dataframe d1 is in test.csv file).
itera = pd.read_csv("test.csv",index_col="index",iterator=True,chunksize=2)
But when I do
for i in itera: d2 = pd.concat((d2,i), axis=1,join="outer")
my conclusion is the first data frame added by the second data frame.
My conclusion is as follows:
col1 col2 col3 yes/no one NaN NaN NaN 1.0 two NaN NaN NaN 0.0 three NaN NaN NaN 1.0 four NaN NaN NaN 0.0 five NaN NaN NaN 1.0 six NaN NaN NaN 1.0 seven NaN NaN NaN 1.0 eight NaN NaN NaN 0.0 nine NaN NaN NaN 0.0 one 1.0 5.0 10.0 NaN two 2.0 4.0 10.0 NaN three 3.0 3.0 10.0 NaN four 4.0 2.0 10.0 NaN five 5.0 5.0 10.0 NaN six 6.0 43.0 4.0 NaN seven 7.0 2.0 10.0 NaN eight 8.0 5.0 10.0 NaN nine 9.0 6.0 10.0 NaN
Hope my question makes sense :)