Take the following DataFrame toy:
data = np.arange(35, dtype=np.float32).reshape(7, 5) data = pd.concat(( pd.DataFrame(list('abcdefg'), columns=['field1']), pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])), axis=1) data.iloc[1:4, 4:] = np.nan data.iloc[4, 3:] = np.nan print(data) field1 field2 2014 2015 2016 2017 0 a 0.0 1.0 2.0 3.0 4.0 1 b 5.0 6.0 7.0 NaN NaN 2 c 10.0 11.0 12.0 NaN NaN 3 d 15.0 16.0 17.0 NaN NaN 4 e 20.0 21.0 NaN NaN NaN 5 f 25.0 26.0 27.0 28.0 29.0 6 g 30.0 31.0 32.0 33.0 34.0
I would like to replace the year columns (2014-2017) with two fields: the most recent non-empty observation and the corresponding year of observation . Suppose field1 is a unique key. (I do not want to do any group operations, just 1 line per record.) Ie:
field1 field2 obs date 0 a 0.0 4.0 2017 1 b 5.0 7.0 2015 2 c 10.0 12.0 2015 3 d 15.0 17.0 2015 4 e 20.0 21.0 2014 5 f 25.0 29.0 2017 6 g 30.0 34.0 2017
I got this far:
pd.melt(data, id_vars=['field1', 'field2'], value_vars=['2014', '2015', '2016', '2017'])\ .dropna(subset=['value']) field1 field2 variable value 0 a 0.0 2014 1.0 1 b 5.0 2014 6.0 2 c 10.0 2014 11.0 3 d 15.0 2014 16.0 4 e 20.0 2014 21.0 5 f 25.0 2014 26.0 6 g 30.0 2014 31.0 # ...
But I'm struggling with how to get the desired format back.
source share