Comparing strings of two pandas data?

This is a continuation of my question. The fastest way to compare strings of two pandas data:

I have two data frames Aand B:

A - This is 1000 rows of 500 columns filled with binary values ​​indicating either the presence or absence.

For a concise example:

    A   B   C   D   E  
0   0   0   0   1   0  
1   1   1   1   1   0  
2   1   0   0   1   1  
3   0   1   1   1   0  

B is 1024 rows x 10 columns and is a complete iteration from 0 to 1023 in binary form.

Example:

     0   1   2  
0    0   0   0  
1    0   0   1   
2    0   1   0  
3    0   1   1  
4    1   0   0
5    1   0   1
6    1   1   0 
7    1   1   1

I am trying to find which rows in A, in the specific 10 columns A, correspond to each row B.

Each row is A[My_Columns_List]guaranteed somewhere in B, but not every row Bwill correspond to a row inA[My_Columns_List]

For example, I want to show that for the columns [B,D,E]of A,

rows [1,3] of A [6] of B,

row [0] of A [2] of B,

row [2] of A [3] B.

:

pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(), 
suffixes = ('_B','_A')))

, , :

S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]

, out_row_idx , A, . , , , 0 999. !

, @ @jezrael @Divakar.

+4
1

, , , .

2 pandas dataframes. - dataframes. numpy, dataframes.

, 1000 x 500 . .

A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init

, A a MultiIndex 10.

@Divakar , .

10 - 8. , , 2.

twos = 2 ** np.arange(10)

10 ,

AtB = A.stack(0).dot(twos).unstack()

I stack, 50 10 , . unstack.

1000 x 50, 0 1023.

, B - , 1024 . B B = B.sort_values().reset_index(drop=True).

, , , .

AtB.loc[:2, :2]

enter image description here

(0, 0), 951 , 10 A B 951. , !!! , . , , B !!! 0 1023. , B. .

, A B dataframe , A B. , B.

def FindAinB(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    rng = np.arange(A.shape[1])
    A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])

    twos = 2 ** np.arange(10)

    return A.stack(0).dot(twos).unstack()

def FindAinB2(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    rng = np.arange(A.shape[1])
    A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
    # use clever bit shifting instead of dot product with powers
    # questionable improvement
    return (A.stack(0) << np.arange(10)).sum(1).unstack()

@Divakar (, , Divakar)

def FindAinB3(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    a = A.values.reshape(-1, 10)
    a = np.einsum('ij->i', a << np.arange(10))
    return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)

f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)

f(A)

Timing

FindAinB3

enter image description here

+6

Source: https://habr.com/ru/post/1648593/


All Articles