Calculating subtractions of column pairs in pandas DataFrame

I work with significant sizes (48K rows, up to dozens of columns) of DataFrames. At some point in manipulating them, I need to do double subtractions of the column values, and I was wondering if there is a more efficient way to do this, and not the one I do (see below).

My current code is:

# Matrix is the pandas DataFrame containing all the data comparison_df = pandas.DataFrame(index=matrix.index) combinations = itertools.product(group1, group2) for observed, reference in combinations: observed_data = matrix[observed] reference_data = matrix[reference] comparison = observed_data - reference_data name = observed + "_" + reference comparison_df[name] = comparison 

Since the data can be large (I use this piece of code also during the permutation test), I am interested to know if it can be optimized a little.

EDIT: as requested here is a sample of a typical dataset

 ID A1 A2 A3 B1 B2 B3 Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279 fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739 ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453 x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333 

And a typical result would be if group “A” group1 and “B” group2 , for each row of the identifier should have a pair for each column (for example, A1_B1, A2_B1, A3_B1 ...) corresponding to the pairs generated above containing subtraction for each line identifier.

+4
source share
1 answer

Using itertools.combinations() in DataFrame columns

You can create combinations of columns with itertools.combinations() and evaluate subtractions along with new names based on these pairs:

 import pandas as pd from cStringIO import StringIO import itertools as iter matrix = pd.read_csv(StringIO('''ID,A1,A2,A3,B1,B2,B3 Ku8QhfS0n_hIOABXuE,6.343,6.304,6.410,6.287,6.403,6.279 fqPEquJRRlSVSfL.8A,6.752,6.681,6.680,6.677,6.525,6.739 ckiehnugOno9d7vf1Q,6.297,6.248,6.524,6.382,6.316,6.453 x57Vw5B5Fbt5JUnQkI,6.268,6.451,6.379,6.371,6.458,6.333''')).set_index('ID') print 'Original DataFrame:' print matrix print # Create DataFrame to fill with combinations comparison_df = pd.DataFrame(index=matrix.index) # Create combinations of columns for a, b in iter.combinations(matrix.columns, 2): # Subtract column combinations comparison_df['{}_{}'.format(a, b)] = matrix[a] - matrix[b] print 'Combination DataFrame:' print comparison_df 

 Original DataFrame: A1 A2 A3 B1 B2 B3 ID Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279 fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739 ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453 x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333 Combination DataFrame: A1_A2 A1_A3 A1_B1 A1_B2 A1_B3 A2_A3 A2_B1 A2_B2 A2_B3 A3_B1 A3_B2 A3_B3 B1_B2 B1_B3 B2_B3 ID Ku8QhfS0n_hIOABXuE 0.039 -0.067 0.056 -0.060 0.064 -0.106 0.017 -0.099 0.025 0.123 0.007 0.131 -0.116 0.008 0.124 fqPEquJRRlSVSfL.8A 0.071 0.072 0.075 0.227 0.013 0.001 0.004 0.156 -0.058 0.003 0.155 -0.059 0.152 -0.062 -0.214 ckiehnugOno9d7vf1Q 0.049 -0.227 -0.085 -0.019 -0.156 -0.276 -0.134 -0.068 -0.205 0.142 0.208 0.071 0.066 -0.071 -0.137 x57Vw5B5Fbt5JUnQkI -0.183 -0.111 -0.103 -0.190 -0.065 0.072 0.080 -0.007 0.118 0.008 -0.079 0.046 -0.087 0.038 0.125 
+1
source

Source: https://habr.com/ru/post/1443034/


All Articles