Pandas data frame - lambda calculus and the minimum value for each series

I have a csv that contains 3 columns, count_id, AMV and time.

I use pandas and read this as a data frame.

results= pd.read_csv('./output.csv') 

First, I sort the data frame first for count_id, and then for AMV.

 results_sorted = results.sort_index(by=['count_id','AMV'], ascending=[True, True]) 

This gives

  count_id AMV Hour 0 16012E 4004 14 1 16012E 4026 12 2 16012E 4099 15 3 16012E 4167 11 4 16012E 4239 10 5 16012E 4324 13 6 16012E 4941 16 7 16012E 5088 17 8 16012E 5283 9 9 16012E 5620 8 10 16012E 5946 18 11 16012E 6146 7 12 16012W 3622 10 13 16012W 3904 12 14 16012W 3979 11 15 16012W 4076 9 16 16012W 4189 13 17 16012W 4870 14 18 16012W 4899 18 19 16012W 5107 15 20 16012W 5659 8 21 16012W 6325 7 22 16012W 6460 17 23 16012W 6500 16 

Now I want to do some normalization of the data so that I can ultimately build it on the same plot. I want to find the minimum value for AMV for each series (count_id), and then subtract this minimum value from the given AMV. This will give me a new AMV_norm column.

What it looks like:

  count_id AMV Hour AMV_norm 0 16012E 4004 14 0 1 16012E 4026 12 22 2 16012E 4099 15 95 3 16012E 4167 11 163 4 16012E 4239 10 235 5 16012E 4324 13 320 6 16012E 4941 16 937 7 16012E 5088 17 1084 8 16012E 5283 9 1279 9 16012E 5620 8 1616 10 16012E 5946 18 1942 11 16012E 6146 7 2142 12 16012W 3622 10 0 13 16012W 3904 12 282 14 16012W 3979 11 357 15 16012W 4076 9 454 16 16012W 4189 13 567 17 16012W 4870 14 1248 18 16012W 4899 18 1277 19 16012W 5107 15 1485 20 16012W 5659 8 2037 21 16012W 6325 7 2703 22 16012W 6460 17 2838 23 16012W 6500 16 2878 

How to define a function that finds the minimum AMV for each series, rather than the minimum AMV in general? It will look something like this:

 def minimum_series_value(AMV): return AMV.argmin() 

Then I need to create a new column, and with the help of the lambda function this row will be filled. I know this will look something like this:

 results_sorted['AMV_norm'] = results_sorted.apply(lambda row:results_sorted(row['AMV'])) 
+5
source share
2 answers

Subtract the AMV column from the min transform:

 In [11]: df.groupby('count_id')["AMV"].transform('min') Out[11]: 0 4004 1 4004 2 4004 3 4004 4 4004 ... 21 3622 22 3622 23 3622 dtype: int64 In [12]: df["AMV"] - df.groupby('count_id')["AMV"].transform('min') Out[12]: 0 0 1 22 2 95 3 163 4 235 ... 21 2703 22 2838 23 2878 dtype: int64 In [13]: df["AMV_norm"] = df["AMV"] - df.groupby('count_id')["AMV"].transform('min') 
+3
source

I believe that you want to group by count_id , and then calculate the difference between the current values ​​and the minimum value for this group.

 df['AMV_norm'] = (df.groupby('count_id').AMV .transform(lambda group_series: group_series - np.min(group_series))) >>> df count_id AMV Hour AMV_norm 0 16012E 4004 14 0 1 16012E 4026 12 22 2 16012E 4099 15 95 3 16012E 4167 11 163 4 16012E 4239 10 235 5 16012E 4324 13 320 6 16012E 4941 16 937 7 16012E 5088 17 1084 8 16012E 5283 9 1279 9 16012E 5620 8 1616 10 16012E 5946 18 1942 11 16012E 6146 7 2142 12 16012W 3622 10 0 13 16012W 3904 12 282 14 16012W 3979 11 357 15 16012W 4076 9 454 16 16012W 4189 13 567 17 16012W 4870 14 1248 18 16012W 4899 18 1277 19 16012W 5107 15 1485 20 16012W 5659 8 2037 21 16012W 6325 7 2703 22 16012W 6460 17 2838 23 16012W 6500 16 2878 

EDIT : @AndyHayden's approach is a bit faster:

 %timeit df["AMV"] - df.groupby('count_id')["AMV"].transform('min') 1000 loops, best of 3: 736 Β΅s per loop %timeit df.groupby('count_id').AMV.transform(lambda x: x - np.min(x)) 1000 loops, best of 3: 804 Β΅s per loop %timeit df.groupby('count_id').AMV.apply(lambda x: x - np.min(x)) 1000 loops, best of 3: 1.32 ms per loop 
+1
source

Source: https://habr.com/ru/post/1234307/


All Articles