I have a dataframe A that has three columns: 'id', 'value' and 'date'. I did a group by id and noticed something strange: if I do two consecutive sortings for rows with a given identifier, first by value and then by date, the order of operations affects the order of the lines. Pay attention to the order of the lines with indices 42915 and 42916:
A.sort_values('value').sort_values('date')[A.sort_values('value').sort_values('date')['id'] == '0001249666']
id value date
42913 0001249666 113845.0 20130408
42914 0001249666 114597.0 20130430
42916 0001249666 125972.0 20140414
42915 0001249666 125971.0 20140414
42917 0001249666 136154.0 20150410
42918 0001249666 145551.0 20160407
42919 0001249666 152911.0 20170413
A[A['id'] == '0001249666'].sort_values('value').sort_values('date')
id value date
42913 0001249666 113845.0 20130408
42914 0001249666 114597.0 20130430
42915 0001249666 125971.0 20140414
42916 0001249666 125972.0 20140414
42917 0001249666 136154.0 20150410
42918 0001249666 145551.0 20160407
42919 0001249666 152911.0 20170413
If I, however, use .sort_values with two arguments, the order does not matter.
A.sort_values(['date','value'])[A.sort_values(['date','value'])['id'] == '0001249666']
id value date
42913 0001249666 113845.0 20130408
42914 0001249666 114597.0 20130430
42915 0001249666 125971.0 20140414
42916 0001249666 125972.0 20140414
42917 0001249666 136154.0 20150410
42918 0001249666 145551.0 20160407
42919 0001249666 152911.0 20170413
a[a['id'] == '0001249666'].sort_values(['date','value'])
id value date
42913 0001249666 113845.0 20130408
42914 0001249666 114597.0 20130430
42915 0001249666 125971.0 20140414
42916 0001249666 125972.0 20140414
42917 0001249666 136154.0 20150410
42918 0001249666 145551.0 20160407
42919 0001249666 152911.0 20170413
Now I know that what I'm doing is not the smartest way to do what I want, but I'm really interested in understanding what this behavior can explain? What do I mean when I do not understand this behavior.
source
share