Merging data with iteration using pandas

I am trying to combine two data frames in pandas using read_csv. But one of my dataframes ( d1 in this example) is too big for my computer, so I use the iterator argument in read_csv .

Say I have two data frames

 d1 = pd.DataFrame({ "col1":[1,2,3,4,5,6,7,8,9], "col2": [5,4,3,2,5,43,2,5,6], "col3": [10,10,10,10,10,4,10,10,10]}, index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"]) d2 = pd.DataFrame({ "yes/no": [1,0,1,0,1,1,1,0,0]}, index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"]) 

I need to combine them so that each row captures all the data for each person, so the equivalent is:

 pd.concat((d1,d2), axis=1,join="outer") 

but since I cannot put d1 in memory, I used read_csv (I use read_csv because I already processed the huge file and saved it in .csv format, so imagine my dataframe d1 is in test.csv file).

 itera = pd.read_csv("test.csv",index_col="index",iterator=True,chunksize=2) 

But when I do

 for i in itera: d2 = pd.concat((d2,i), axis=1,join="outer") 

my conclusion is the first data frame added by the second data frame.

My conclusion is as follows:

  col1 col2 col3 yes/no one NaN NaN NaN 1.0 two NaN NaN NaN 0.0 three NaN NaN NaN 1.0 four NaN NaN NaN 0.0 five NaN NaN NaN 1.0 six NaN NaN NaN 1.0 seven NaN NaN NaN 1.0 eight NaN NaN NaN 0.0 nine NaN NaN NaN 0.0 one 1.0 5.0 10.0 NaN two 2.0 4.0 10.0 NaN three 3.0 3.0 10.0 NaN four 4.0 2.0 10.0 NaN five 5.0 5.0 10.0 NaN six 6.0 43.0 4.0 NaN seven 7.0 2.0 10.0 NaN eight 8.0 5.0 10.0 NaN nine 9.0 6.0 10.0 NaN 

Hope my question makes sense :)

+5
source share
1 answer

I think you are looking to combine the first method. It basically updates df1 values ​​from each fragment in the read_csv iterator.

 import pandas as pd from StringIO import StringIO d1 = pd.DataFrame({ "col1":[1,2,3,4,5,6,7,8,9], "col2": [5,4,3,2,5,43,2,5,6], "col3": [10,10,10,10,10,4,10,10,10]}, index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"]) #d2 converted to string tho use with pd.read_csv d2 = StringIO("""y/n col5 paul 1 peter 0 lauren 1 dave 0 bill 1 steve 1 old-man 1 bob 0 tim 0 """) #For each chunk update d1 with data for chunk in pd.read_csv(d2, sep = ' ',iterator=True,chunksize=1): d1 = d1.combine_first(chunk[['y/n']]) #Number formatting d1['y/n'] = d1['y/n'].astype(int) 

Which returns d1 , like:

  col1 col2 col3 y/n bill 5 5 10 1 bob 8 5 10 0 dave 4 2 10 0 lauren 3 3 10 1 old-man 7 2 10 1 paul 1 5 10 1 peter 2 4 10 0 steve 6 43 4 1 tim 9 6 10 0 
+1
source

Source: https://habr.com/ru/post/1273861/


All Articles