How to perform column calculation based on column name for large table

I have a table that contains more than 200 columns. Columns have different pairs (for example, two types of Benz), the following is an example. I want to do this in order to calculate the difference of each pair for a new column (for example, the last column as an example). I thought to separate the tables with two tables (A and B) according to the first letter and sort the column. But is there a more efficient way in Pandas? Thank!

A_Benz  B_Benz  A_Audi  B_Audi  A_Honda B_Honda dif_Audi
1   0   1   1   0   0   0
1   0   0   1   0   0   -1
1   0   0   1   0   0   -1
1   0   1   1   1   1   0
1   0   0   1   0   0   -1
+4
source share
3 answers

Assuming this is your starting point -

df
   A_Benz  B_Benz  A_Audi  B_Audi  A_Honda  B_Honda
1       1       0       1       1        0        0
2       1       0       0       1        0        0
3       1       0       0       1        0        0
4       1       0       1       1        1        1
5       1       0       0       1        0        0

Option 1
This would make a good use case for filter:

i = df.filter(regex='^A_*')
j = df.filter(regex='^B_*')

i.columns = i.columns.str.split('_', 1).str[-1]
j.columns = j.columns.str.split('_', 1).str[-1]

(i - j).add_prefix('diff_')

   diff_Benz  diff_Audi  diff_Honda
1          1          0           0
2          1         -1           0
3          1         -1           0
4          1          0           0
5          1         -1           0

, concat

df = pd.concat([df, (i - j).add_prefix('diff_')], axis=1)

2
diff; :

import re

# if needed, order the columns correctly
df = df[sorted(df.columns, key=lambda x: x.split('_', 1)[1])]
# compute consecutive column differences
df.diff(-1, axis=1).iloc[:, ::2].rename(columns=lambda x: re.sub('A_', 'diff_', x))

   diff_Benz  diff_Audi  diff_Honda
1        1.0        0.0         0.0
2        1.0       -1.0         0.0
3        1.0       -1.0         0.0
4        1.0        0.0         0.0
5        1.0       -1.0         0.0

( @jpp) -

c = sorted(df.columns, key=lambda x: x.split('_', 1)[1])
df = df[c]

pd.DataFrame(
    df.iloc[:, ::2].values - df.iloc[:, 1::2].values, columns=c[::2]
)

   A_Audi  A_Benz  A_Honda
0       0       1        0
1      -1       1        0
2      -1       1        0
3       0       1        0
4      -1       1        0
+4

IIUC

s=pd.Series(df.columns).str.split('_',expand=True)[1]
df.groupby(s.values,axis=1).diff().dropna(axis=1)
Out[1252]: 
   B_Benz  B_Audi  B_Honda
1    -1.0     0.0      0.0
2    -1.0     1.0      0.0
3    -1.0     1.0      0.0
4    -1.0     0.0      0.0
5    -1.0     1.0      0.0
+4

This is a decision based numpy.

While performance may not be a problem, it should be more efficient than pandonic methods.

df = df[sorted(df, key=lambda x: x.split('_')[::-1])]

A = df.values
cars = [x[2:] for x in df.columns[::2]]

res = df.join(pd.DataFrame(A[:, ::2] - A[:, 1::2], columns=cars).add_prefix('Diff_'))

Result

   A_Audi  B_Audi  A_Benz  B_Benz  A_Honda  B_Honda  Diff_Audi  Diff_Benz  \
0       1       1       1       0        0        0          0          1   
1       0       1       1       0        0        0         -1          1   
2       0       1       1       0        0        0         -1          1   
3       1       1       1       0        1        1          0          1   
4       0       1       1       0        0        0         -1          1   

   Diff_Honda  
0           0  
1           0  
2           0  
3           0  
4           0  

Explanation

  • Sort columns by make of car, then prefix.
  • Removing a car is done through a list comprehension.
  • Use an array slice numpyto create a differential frame and attach it to the original.
+1
source

Source: https://habr.com/ru/post/1695437/


All Articles