Combining 2 csv datasets with Python with a common identifier column - one csv has multiple records for a unique identifier

I am very new to Python. Any support is much appreciated

I have two csv files that I am trying to combine using the Student_ID column and create a new csv file.

csv 1: each entry has a unique studentID

Student_ID Age Course startYear 119 24 Bsc 2014 

csv2: has several entries for studentID, as it has a new entry for each subject that the student accepts

 Student_ID sub_name marks Sub_year_level 119 Botany1 60 2 119 Anatomy 70 2 119 cell bio 75 3 129 Physics1 78 2 129 Math1 60 1 

I want to combine two csv files so that I have all the records and columns from csv1 and the new additional columns created, where I want to get the average indicator (should be calculated) from csv2 for each subject_year_level object for each student. Thus, the final csv file will have unique Student_Ids in all entries

I want my new csv output file to look like this:

 Student_ID Age Course startYear level1_avg_mark levl2_avg_mark levl3_avgmark 119 24 Bsc 2014 60 65 70 
+5
source share
2 answers

You can use pivot_table with join :

Note: replace fill_value with NaN by 0 if it is not necessary to remove it, and the default aggregation function is mean .

 df2 = df2.pivot_table(index='Student_ID', \ columns='Sub_year_level', \ values='marks', \ fill_value=0) \ .rename(columns='level{}_avg_mark'.format) print (df2) Sub_year_level level1_avg_mark level2_avg_mark level3_avg_mark Student_ID 119 0 65 75 129 60 78 0 df = df1.join(df2, on='Student_ID') print (df) Student_ID Age Course startYear level1_avg_mark level2_avg_mark \ 0 119 24 Bsc 2014 0 65 level3_avg_mark 0 75 

EDIT:

Required user function:

 print (df2) Student_ID sub_name marks Sub_year_level 0 119 Botany1 0 2 1 119 Botany1 0 2 2 119 Anatomy 72 2 3 119 cell bio 75 3 4 129 Physics1 78 2 5 129 Math1 60 1 f = lambda x: x[x != 0].mean() df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f) .rename(columns='level{}_avg_mark'.format).reset_index() print (df2) Sub_year_level Student_ID level1_avg_mark level2_avg_mark level3_avg_mark 0 119 NaN 72.0 75.0 1 129 60.0 78.0 NaN 
+3
source

You can use groupby to calculate average grade scores.
Then unstack to get all levels on one line.
rename columns.

Once this is done, groupby + unstack convenient to leave Student_ID in the index, which makes Student_ID easy. It remains only to do the join and specify the on parameter.

 d1.join( d2.groupby( ['Student_ID', 'Sub_year_level'] ).marks.mean().unstack().rename(columns='level{}_avg_mark'.format), on='Student_ID' ) 

enter image description here

+3
source

Source: https://habr.com/ru/post/1265530/


All Articles