Retrieving the latest observation and dates from multiple columns

Take the following DataFrame toy:

data = np.arange(35, dtype=np.float32).reshape(7, 5) data = pd.concat(( pd.DataFrame(list('abcdefg'), columns=['field1']), pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])), axis=1) data.iloc[1:4, 4:] = np.nan data.iloc[4, 3:] = np.nan print(data) field1 field2 2014 2015 2016 2017 0 a 0.0 1.0 2.0 3.0 4.0 1 b 5.0 6.0 7.0 NaN NaN 2 c 10.0 11.0 12.0 NaN NaN 3 d 15.0 16.0 17.0 NaN NaN 4 e 20.0 21.0 NaN NaN NaN 5 f 25.0 26.0 27.0 28.0 29.0 6 g 30.0 31.0 32.0 33.0 34.0 

I would like to replace the year columns (2014-2017) with two fields: the most recent non-empty observation and the corresponding year of observation . Suppose field1 is a unique key. (I do not want to do any group operations, just 1 line per record.) Ie:

  field1 field2 obs date 0 a 0.0 4.0 2017 1 b 5.0 7.0 2015 2 c 10.0 12.0 2015 3 d 15.0 17.0 2015 4 e 20.0 21.0 2014 5 f 25.0 29.0 2017 6 g 30.0 34.0 2017 

I got this far:

 pd.melt(data, id_vars=['field1', 'field2'], value_vars=['2014', '2015', '2016', '2017'])\ .dropna(subset=['value']) field1 field2 variable value 0 a 0.0 2014 1.0 1 b 5.0 2014 6.0 2 c 10.0 2014 11.0 3 d 15.0 2014 16.0 4 e 20.0 2014 21.0 5 f 25.0 2014 26.0 6 g 30.0 2014 31.0 # ... 

But I'm struggling with how to get the desired format back.

+5
source share
4 answers

May be:

 d2 = data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs").dropna(subset=["obs"]) d2["date"] = d2["date"].astype(int) df = d2.loc[d2.groupby(["field1", "field2"])["date"].idxmax()] 

which gives me

  field1 field2 date obs 21 a 0.0 2017 4.0 8 b 5.0 2015 7.0 9 c 10.0 2015 12.0 10 d 15.0 2015 17.0 4 e 20.0 2014 21.0 26 f 25.0 2017 29.0 27 g 30.0 2017 34.0 
+4
source

what about the following apporach:

 In [160]: df Out[160]: field1 field2 2014 2015 2016 2017 0 a 0.0 1.0 2.0 3.0 -10.0 1 b 5.0 6.0 7.0 NaN NaN 2 c 10.0 11.0 12.0 NaN NaN 3 d 15.0 16.0 17.0 NaN NaN 4 e 20.0 21.0 NaN NaN NaN 5 f 25.0 26.0 27.0 28.0 29.0 6 g 30.0 31.0 32.0 33.0 34.0 In [180]: df.groupby(lambda x: 'obs' if x.isdigit() else x, axis=1) \ ...: .last() \ ...: .assign(date=df.filter(regex='^\d{4}').loc[:, ::-1].notnull().idxmax(1)) Out[180]: field1 field2 obs date 0 a 0.0 -10.0 2017 1 b 5.0 7.0 2015 2 c 10.0 12.0 2015 3 d 15.0 17.0 2015 4 e 20.0 21.0 2014 5 f 25.0 29.0 2017 6 g 30.0 34.0 2017 
+3
source

last_valid_index + agg('last')

 A=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1) B=data.groupby(['value'] * data.shape[1], 1).agg('last') data['date']=A data['obs']=B data Out[1326]: field1 field2 2014 2015 2016 2017 date obs 0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0 1 b 5.0 6.0 7.0 NaN NaN 2015 7.0 2 c 10.0 11.0 12.0 NaN NaN 2015 12.0 3 d 15.0 16.0 17.0 NaN NaN 2015 17.0 4 e 20.0 21.0 NaN NaN NaN 2014 21.0 5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0 6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0 

Using assign , we can insert them on the same line as a hit

 data.assign(date=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1),obs=data.groupby(['value'] * data.shape[1], 1).agg('last')) Out[1340]: field1 field2 2014 2015 2016 2017 date obs 0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0 1 b 5.0 6.0 7.0 NaN NaN 2015 7.0 2 c 10.0 11.0 12.0 NaN NaN 2015 12.0 3 d 15.0 16.0 17.0 NaN NaN 2015 17.0 4 e 20.0 21.0 NaN NaN NaN 2014 21.0 5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0 6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0 
+2
source

Another possibility using sort_values and drop_duplicates :

 data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs")\ .dropna(subset=['obs'])\ .sort_values(['field1', 'date'], ascending=[True, False])\ .drop_duplicates('field1', keep='first') 

which gives you

  field1 field2 date obs 21 a 0.0 2017 4.0 8 b 5.0 2015 7.0 9 c 10.0 2015 12.0 10 d 15.0 2015 17.0 4 e 20.0 2014 21.0 26 f 25.0 2017 29.0 27 g 30.0 2017 34.0 
+1
source

Source: https://habr.com/ru/post/1274011/


All Articles