Setting a value for a copy of a DataFrame fragment

I create the following example, which is similar to my situation and data:

Let's say I have the following DataFrame:

df = pd.DataFrame ({'ID' : [1,2,3,4], 'price' : [25,30,34,40], 'Category' : ['small', 'medium','medium','small']}) 


  Category ID price 0 small 1 25 1 medium 2 30 2 medium 3 34 3 small 4 40 

Now I have the following function that returns the discount amount based on the following logic:

 def mapper(price, category): if category == 'small': discount = 0.1 * price else: discount = 0.2 * price return discount 

Now I want to get a DataFrame:

  Category ID price Discount 0 small 1 25 0.25 1 medium 2 30 0.6 2 medium 3 40 0.8 3 small 4 40 0.4 

So, I decided to name series.map at the price of the column, because I do not want to use apply. I am working on a large DataFrame and the map is much faster than applied.

I tried to do this:

 for c in list(sample.Category.unique()): sample[sample['Category'] == c]['Discount'] = sample[sample['Category'] == c]['price'].map(lambda x: mapper(x,c)) 

And this did not work as I expected, because I am trying to set the value on a copy of a DataFrame fragment.

My question is, is there a way to do this without using df.apply() ?

+5
source share
3 answers

One approach with np.where is

 mask = df.Category.values=='small' df['Discount'] = np.where(mask,df.price*0.01, df.price*0.02) 

Another way to put things differently is

 df['Discount'] = df.price*0.01 df['Discount'][df.Category.values!='small'] *= 2 

For performance, you may need to work with array data, so we could use df.price.values instead of using df.price tags.

Benchmarking

Approaches -

 def app1(df): # Proposed app#1 here mask = df.Category.values=='small' df_price = df.price.values df['Discount'] = np.where(mask,df_price*0.01, df_price*0.02) return df def app2(df): # Proposed app#2 here df['Discount'] = df.price.values*0.01 df['Discount'][df.Category.values!='small'] *= 2 return df def app3(df): # @piRSquared soln df.assign( Discount=((1 - (df.Category.values == 'small')) + 1) / 100 * df.price.values) return df def app4(df): # @MaxU soln df.assign(Discount=df.price * df.Category.map({'small':0.01}).fillna(0.02)) return df 

Dates -

1) Big data set:

 In [122]: df Out[122]: Category ID price Discount 0 small 1 25 0.25 1 medium 2 30 0.60 2 medium 3 34 0.68 3 small 4 40 0.40 In [123]: df1 = pd.concat([df]*1000,axis=0) ...: df2 = pd.concat([df]*1000,axis=0) ...: df3 = pd.concat([df]*1000,axis=0) ...: df4 = pd.concat([df]*1000,axis=0) ...: In [124]: %timeit app1(df1) ...: %timeit app2(df2) ...: %timeit app3(df3) ...: %timeit app4(df4) ...: 1000 loops, best of 3: 209 µs per loop 10 loops, best of 3: 63.2 ms per loop 1000 loops, best of 3: 351 µs per loop 1000 loops, best of 3: 720 µs per loop 

2) A very large data set:

 In [125]: df1 = pd.concat([df]*10000,axis=0) ...: df2 = pd.concat([df]*10000,axis=0) ...: df3 = pd.concat([df]*10000,axis=0) ...: df4 = pd.concat([df]*10000,axis=0) ...: In [126]: %timeit app1(df1) ...: %timeit app2(df2) ...: %timeit app3(df3) ...: %timeit app4(df4) ...: 1000 loops, best of 3: 758 µs per loop 1 loops, best of 3: 2.78 s per loop 1000 loops, best of 3: 1.37 ms per loop 100 loops, best of 3: 2.57 ms per loop 

Further enhancing data reuse -

 def app1_modified(df): mask = df.Category.values=='small' df_price = df.price.values*0.01 df['Discount'] = np.where(mask,df_price, df_price*2) return df 

Dates -

 In [133]: df1 = pd.concat([df]*10000,axis=0) ...: df2 = pd.concat([df]*10000,axis=0) ...: df3 = pd.concat([df]*10000,axis=0) ...: df4 = pd.concat([df]*10000,axis=0) ...: In [134]: %timeit app1(df1) 1000 loops, best of 3: 699 µs per loop In [135]: %timeit app1_modified(df1) 1000 loops, best of 3: 655 µs per loop 
+8
source

Also using some numpy

 df.assign( Discount=((1 - (df.Category.values == 'small')) + 1) / 100 * df.price.values) Category ID price Discount 0 small 1 25 0.25 1 medium 2 30 0.60 2 medium 3 34 0.68 3 small 4 40 0.40 

Operational component

 (1 - (df.Category.values == 'small')) + 1) / 100 * df.price.values 

This generates a single boolean array and performs simple arithmetic on it to get .01 and .02 .


naive time testing for given data

enter image description here


thanks @Divakar for pointing this out For those using python 2.x, you need to force a float problem using this.

 df.assign( Discount=((1 - (df.Category.values == 'small')) + 1) / 100. * df.price.values) 
+4
source

Here is another Pandas approach:

 In [67]: df.assign(Discount=df.price * df.Category.map({'small':0.01}).fillna(0.02)) Out[67]: Category ID price Discount 0 small 1 25 0.25 1 medium 2 30 0.60 2 medium 3 34 0.68 3 small 4 40 0.40 
+4
source

Source: https://habr.com/ru/post/1265717/


All Articles