Pandas external product of two information frames with the same index

Consider the following data frames d1andd1

d1 = pd.DataFrame([
    [1, 2, 3],
    [2, 3, 4],
    [3, 4, 5],
    [1, 2, 3],
    [2, 3, 4],
    [3, 4, 5]
], columns=list('ABC'))

d2 = pd.get_dummies(list('XYZZXY'))

d1

   A  B  C
0  1  2  3
1  2  3  4
2  3  4  5
3  1  2  3
4  2  3  4
5  3  4  5

d2

   X  Y  Z
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  1
4  1  0  0
5  0  1  0

I need to get a new framework with an object with multiple indices that has the product of each combination of columns from d1andd2


So far I have done this ...

from itertools import product
pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)

   A        B        C      
   X  Y  Z  X  Y  Z  X  Y  Z
0  1  0  0  2  0  0  3  0  0
1  0  2  0  0  3  0  0  4  0
2  0  0  3  0  0  4  0  0  5
3  0  0  1  0  0  2  0  0  3
4  2  0  0  3  0  0  4  0  0
5  0  3  0  0  4  0  0  5  0

There is nothing wrong with this method. But I am looking for alternatives to evaluate.


Inspired by Yakim Pirozhenko

m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values

pd.DataFrame(
    v1[:, lvl0] * v2[:, lvl1],
    d1.index,
    pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)

However, this is the more awkward implementation of the numpy broadcast, which is best covered by Divakar.

Timing
All answers were good answers and demonstrated various aspects of pandas and numpy. Please consider them if you find them useful and informative.

%%timeit
m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values

pd.DataFrame(
    v1[:, lvl0] * v2[:, lvl1],
    d1.index,
    pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)

%%timeit 
vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
pd.DataFrame(vals, columns=cols, index=d1.index)

%timeit d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()
%timeit pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
%timeit pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)

1000 loops, best of 3: 663 ยตs per loop
1000 loops, best of 3: 624 ยตs per loop
100 loops, best of 3: 3.38 ms per loop
1000 loops, best of 3: 860 ยตs per loop
100 loops, best of 3: 2.01 ms per loop
+4
4

NumPy broadcasting -

vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
df_out = pd.DataFrame(vals, columns=cols, index=d1.index)

-

In [92]: d1
Out[92]: 
   A  B  C
0  1  2  3
1  2  3  4
2  3  4  5
3  1  2  3
4  2  3  4
5  3  4  5

In [93]: d2
Out[93]: 
   X  Y  Z
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  1
4  1  0  0
5  0  1  0

In [110]: vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
     ...: cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
     ...: df_out = pd.DataFrame(vals, columns=cols, index=d1.index)
     ...: 

In [111]: df_out
Out[111]: 
   A        B        C      
   X  Y  Z  X  Y  Z  X  Y  Z
0  1  0  0  2  0  0  3  0  0
1  0  2  0  0  3  0  0  4  0
2  0  0  3  0  0  4  0  0  5
3  0  0  1  0  0  2  0  0  3
4  2  0  0  3  0  0  4  0  0
5  0  3  0  0  4  0  0  5  0
+3

. , .

In [846]: pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
Out[846]:
   A        B        C
   X  Y  Z  X  Y  Z  X  Y  Z
0  1  0  0  2  0  0  3  0  0
1  0  2  0  0  3  0  0  4  0
2  0  0  3  0  0  4  0  0  5
3  0  0  1  0  0  2  0  0  3
4  2  0  0  3  0  0  4  0  0
5  0  3  0  0  4  0  0  5  0
+3

Here is a single line that uses pandas stack and unstack . The โ€œtrickโ€ is to use stack, so the result of each calculation inside applyis a time series. Then use unstackto get the form Multiindex.

d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()

What gives:

     A              B              C          
     X    Y    Z    X    Y    Z    X    Y    Z
0  1.0  0.0  0.0  2.0  0.0  0.0  3.0  0.0  0.0
1  0.0  2.0  0.0  0.0  3.0  0.0  0.0  4.0  0.0
2  0.0  0.0  3.0  0.0  0.0  4.0  0.0  0.0  5.0
3  0.0  0.0  1.0  0.0  0.0  2.0  0.0  0.0  3.0
4  2.0  0.0  0.0  3.0  0.0  0.0  4.0  0.0  0.0
5  0.0  3.0  0.0  0.0  4.0  0.0  0.0  5.0  0.0
+3
source

First you can get a multi-index, use it to get shapes and multiply directly.

cols = pd.MultiIndex.from_tuples(
        [(c1, c2) for c1 in d1.columns for c2 in d2.columns])

a = d1.loc[:,cols.get_level_values(0)]
b = d2.loc[:,cols.get_level_values(1)]
a.columns = b.columns = cols

res = a * b
+2
source

Source: https://habr.com/ru/post/1682135/


All Articles