Define the shuffled indices of two lists / arrays

As a problem, I set myself this problem:

Given 2 lists, A and B, where B is the shuffled version of A, the idea is to define shuffled indexes.

For instance:

A = [10, 40, 30, 2] B = [30, 2, 10, 40] result = [2, 3, 0, 1] A[2] A[3] A[0] A[1] || || || || 30 2 10 40 

Please note that associations for identical elements can be resolved arbitrarily.

I came up with a solution that uses a dictionary to store indexes. What other possible solutions does this problem have? A solution using the library also works. Numpy, pandas, it's alright.

+5
source share
6 answers

As an improvement over your current solution, you can use collections.defaultdict and avoid dict.setdefault :

 from collections import defaultdict A = [10, 40, 30, 2] B = [30, 2, 10, 40] idx = defaultdict(list) for i, l in enumerate(A): idx[l].append(i) res = [idx[l].pop() for l in B] print(res) 

Below are the timings for the two methods using the input you entered:

Script used for testing

 from timeit import timeit setup = """ from collections import defaultdict; idx1 = defaultdict(list); idx2 = {} A = [10, 40, 30, 2] B = [30, 2, 10, 40] """ me = """ for i, l in enumerate(A): idx1[l].append(i) res = [idx1[l].pop() for l in B] """ coldspeed = """ for i, l in enumerate(A): idx2.setdefault(l, []).append(i) res = [idx2[l].pop() for l in B] """ print(timeit(setup=setup, stmt=me)) print(timeit(setup=setup, stmt=coldspeed)) 

results

 original: 2.601998388010543 modified: 2.0607256239745766 

Thus, it seems that using defaultdict actually gives a slight increase in speed. This actually happens because, since defaultdict is implemented in C, not Python. Not to mention that finding the attribute of the original solution - idx.setdefault1 - is expensive.

+3
source

We can use np.searchsorted with its optional sorter argument -

 sidx = np.argsort(B) out = sidx[np.searchsorted(B,A, sorter=sidx)] 

Run Example -

 In [19]: A = [10, 40, 30, 2, 40] ...: B = [30, 2, 10, 40] ...: In [20]: sidx = np.argsort(B) In [21]: sidx[np.searchsorted(B,A, sorter=sidx)] Out[21]: array([2, 3, 0, 1, 3]) 
+8
source

Lol

 pd.Series(A).reset_index().set_index(0).ix[B].T.values[0] #array([2, 3, 0, 1]) 
+3
source

As mentioned in my question, I was able to solve this using a dictionary. I store indexes in a dict and then use list comprehension to print them:

 A = [10, 40, 30, 2] B = [30, 2, 10, 40] idx = {} for i, l in enumerate(A): idx.setdefault(l, []).append(i) res = [idx[l].pop() for l in B] print(res) 

Output:

 [2, 3, 0, 1] 

This is better than the obvious [A.index(x) for x in B] because it

  • linear
  • competently processes duplicates.
+2
source

The numpy_indexed package has an effective and general solution for this:

 import numpy_indexed as npi result = npi.indices(A, B) 

Note that it has kwarg to set the mode to handle missing values; and it works with nd arrays of any type in exactly the same way as with 1d integer arrays.

+2
source

Since several very good solutions were posted, I allowed myself to collect some raw timings to compare each method.

Script used for testing

 from timeit import timeit setup = """ from collections import defaultdict import pandas as pd import numpy as np idx1 = defaultdict(list); idx2 = {} A = [10, 40, 30, 2] B = [30, 2, 10, 40] """ me = """ for i, l in enumerate(A): idx1[l].append(i) res = [idx1[l].pop() for l in B] """ coldspeed = """ for i, l in enumerate(A): idx2.setdefault(l, []).append(i) res = [idx2[l].pop() for l in B] """ divakar = """ sidx = np.argsort(B) res = sidx[np.searchsorted(B,A, sorter=sidx)] """ dyz = """ res = pd.Series(A).reset_index().set_index(0).ix[B].T.values[0] """ print('mine:', timeit(setup=setup, stmt=me, number=1000)) print('coldspeed:', timeit(setup=setup, stmt=coldspeed, number=1000)) print('divakar:', timeit(setup=setup, stmt=divakar, number=1000)) print('dyz:', timeit(setup=setup, stmt=dyz, number=1000)) 

Result / Output (runs on the Jupyter laptop server, 1000 cycles)

 mine: 0.0026700650341808796 coldspeed: 0.0029303128831088543 divakar: 0.02583012101240456 dyz: 2.208147854078561 

Below are some points in time when size A is 100,000 random numbers. And B is its shuffled equivalent. The program had too much time and memory. I also had to reduce the number of cycles to 100. Otherwise, everything will be the same as above:

 mine: 17.663535300991498 coldspeed: 17.11006522300886 divakar: 8.73397267702967 dyz: 44.61878849985078 
+1
source

Source: https://habr.com/ru/post/1271133/


All Articles