Do you consider using pandas ? If you are familiar with R, then the data frames should be pretty simple. The following gives you what you want:
from pandas import merge, read_table csv1 = read_table('CSV1.csv', sep=r"[;,]", header=None) csv2 = read_table('CSV2.csv', sep=r"[,]", header=None) print csv1 print csv2
Notice that I replaced the tabs with commas and split them into half-columns. The conclusion so far should be:
0 1 2 3 4 5 6 0 data13 data23 d main_data1 main_data2 data13 data23 1 data12 data22 d main_data1 main_data2 data12 data22 2 data11 data21 d main_data1 main_data2 data11 data21 3 data3 data4 d main_data2 main_data4 data3 data4 4 data52 data62 d main_data3 NaN data51 data62 5 data51 data61 d main_data3 NaN main_data3 data61 6 data7 data8 d main_data4 NaN data7 data8 [7 rows x 7 columns] 0 1 2 3 4 0 id1 main_data1 a1 a2 a3 1 id2 main_data2 b1 b2 b3 2 id3 main_data3 c1 c2 c3 3 id4 main_data4 d1 d2 d3 4 id5 main_data5 e1 e2 e3 [5 rows x 5 columns]
Using the left join:
kw1 = dict(how='left', \ left_on=[3,4], \ right_on=[1,1], \ suffixes=('l', 'r')) df1 = merge(csv1, csv2, **kw1) df1.drop_duplicates(cols=[3], inplace=True) print df1[[0,7]]
Gives the zero and seventh merge columns:
3 5 0 main_data1 data13 3 main_data2 data3 4 main_data3 data51 6 main_data4 data7 [4 rows x 2 columns]
And to give the result how you want it, do another merge (this time an external join) with CSV2 :
kw2 = dict(how='outer', \ left_on=[3], \ right_on=[1], \ suffixes=('l', 'r')) df2 = merge(df1, csv2, **kw2) print df2[[15,16,17,18,19,8]]
Conclusion:
0 1 2 3r 4r 5 0 id1 main_data1 a1 a2 a3 data13 1 id2 main_data2 b1 b2 b3 data3 2 id3 main_data3 c1 c2 c3 data51 3 id4 main_data4 d1 d2 d3 data7 4 id5 main_data5 e1 e2 e3 NaN
You do not need to use **kw for keyword arguments. I just used it to do everything horizontally.
I allow read_table and merge define column names. If you assign column names yourself, you will get better output.