Since several very good solutions were posted, I allowed myself to collect some raw timings to compare each method.
Script used for testing
from timeit import timeit setup = """ from collections import defaultdict import pandas as pd import numpy as np idx1 = defaultdict(list); idx2 = {} A = [10, 40, 30, 2] B = [30, 2, 10, 40] """ me = """ for i, l in enumerate(A): idx1[l].append(i) res = [idx1[l].pop() for l in B] """ coldspeed = """ for i, l in enumerate(A): idx2.setdefault(l, []).append(i) res = [idx2[l].pop() for l in B] """ divakar = """ sidx = np.argsort(B) res = sidx[np.searchsorted(B,A, sorter=sidx)] """ dyz = """ res = pd.Series(A).reset_index().set_index(0).ix[B].T.values[0] """ print('mine:', timeit(setup=setup, stmt=me, number=1000)) print('coldspeed:', timeit(setup=setup, stmt=coldspeed, number=1000)) print('divakar:', timeit(setup=setup, stmt=divakar, number=1000)) print('dyz:', timeit(setup=setup, stmt=dyz, number=1000))
Result / Output (runs on the Jupyter laptop server, 1000 cycles)
mine: 0.0026700650341808796 coldspeed: 0.0029303128831088543 divakar: 0.02583012101240456 dyz: 2.208147854078561
Below are some points in time when size A is 100,000 random numbers. And B is its shuffled equivalent. The program had too much time and memory. I also had to reduce the number of cycles to 100. Otherwise, everything will be the same as above:
mine: 17.663535300991498 coldspeed: 17.11006522300886 divakar: 8.73397267702967 dyz: 44.61878849985078