Python Dictionary Search Speed ​​with NumPy Data Type

BACKGROUND

I have many numeric message codes in a NumPy array, and I will need to quickly convert them to strings. I had some performance issues and would like to understand why and how to do it quickly.

SOME STANDARDS

i am a trivial approach

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

The search dictionary takes up most of my coffee break, 758 ms. (I also tried res = map(lookupdict.get, arr), but it's even worse.)

II - Without NumPy

import random

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = [ random.choice(lookupdict.keys()) for _ in range(1000000) ]

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

Synchronization results vary significantly up to 76 ms!

It should be noted that I'm interested in search synchronization. Random generation is just the creation of some test data. Not interested if it takes a lot of time or not. All test results shown here are for only one million searches.

III - NumPy

, - . , NumPy :

res = [ lookupdict[k] for k in list(arr) ]

778 , 110 570 , . , , .

IV - np.int32 int

(np.int32 vs. int), " ". , , , :

res = [ lookupdict[int(k)] for k in arr ]

, , - , 266 . , --- , .

V - np.int32

, NumPy, dict :

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     np.int32(1): "val1",
     np.int32(2): "val2",
    np.int32(27): "val3",
    np.int32(35): "val4",
    np.int32(59): "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

177 . , 76 .

VI - int

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.array([ random.choice(lookupdict.keys()) for _ in range(1000000) ], 
               dtype='object')

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

86 , Python 76 .

  • dict keys int, int ( Python): 76
  • dict int, int (NumPy): 86
  • dict np.int32, np.int32: 177
  • dict int, np.int32: 758

(S)

? , ? - NumPy, ( , ) dict np.int32. ( , dict , - . , , 10 .)

+4
5

II - Without NumPy , I

In [11]: timeit [lookupdict[k] for k in np.random.choice(lookupdict.keys(),1000000)]
1 loops, best of 3: 658 ms per loop

In [12]: timeit [lookupdict[k] for k in [np.random.choice(lookupdict.keys()) for _ in range(1000000)]]
1 loops, best of 3: 8.04 s per loop

, choice ,

In [34]: timeit np.random.choice(lookupdict.values(),1000000)
10 loops, best of 3: 85.3 ms per loop

, :

In [26]: arr =np.random.choice(lookupdict.keys(),1000000)

In [27]: arrlist=arr.tolist()

In [28]: timeit res = [lookupdict[k] for k in arr]
1 loops, best of 3: 583 ms per loop

In [29]: timeit res = [lookupdict[k] for k in arrlist]
10 loops, best of 3: 120 ms per loop

In [30]: timeit res = [lookupdict[k] for k in list(arr)]
1 loops, best of 3: 675 ms per loop

In [31]: timeit res = [lookupdict[k] for k in arr.tolist()]
10 loops, best of 3: 156 ms per loop

In [32]: timeit res = [k for k in arr]
1 loops, best of 3: 215 ms per loop

In [33]: timeit res = [k for k in arrlist]
10 loops, best of 3: 51.4 ms per loop

In [42]: timeit arr.tolist()
10 loops, best of 3: 33.6 ms per loop

In [43]: timeit list(arr)
1 loops, best of 3: 264 ms per loop

- np.array ,

- list(arr) arr.tolist(). list() , 2 . , np.int32.

+2

, int32.__hash__, , x11 , int.__hash__:

%timeit hash(5)
10000000 loops, best of 3: 39.2 ns per loop
%timeit hash(np.int32(5))
1000000 loops, best of 3: 444 ns per loop

( int32 C. , , , ).


EDIT:

, , - == :

a = np.int32(5)
b = np.int32(5)
%timeit a == b  # comparing two int32's
10000000 loops, best of 3: 61.9 ns per loop
%timeit a == 5  # comparing int32 against int -- much slower
100000 loops, best of 3: 2.62 us per loop

, V , IV. , all- int .


, , :

  • int int dict-lookup
  • , / , dict-lookups , hash ing.

:.

lookuplist = [None] * (max(lookupdict.keys()) + 1)
for k,v in lookupdict.items():
    lookuplist[k] = v

res = [ lookuplist[k] for k in arr ] # using list indexing

(EDIT: np.choose )

+5

, , , .

III . , , . :

res = [ lookupdict[k] for k in list(arr) ]

778 .

:

res = [ lookupdict[k] for k in arr.tolist() ]

86 .

, arr.tolist int, list(arr) np.int32.

+3

, , , , , , lookupdict . , , .

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

table = np.empty(max(lookupdict.keys()) + 1, dtype='S4')
for key, value in lookupdict.items():
    table[key] = value

res = table[arr]
0

Pandas, :

import numpy as np
import pandas as pd

# dictionary to use as the lookup dictionary
lookupdict = {
 1: "val1",
 2: "val2",
27: "val3",
35: "val4",
59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
%timeit res = [ lookupdict[k] for k in arr ]
%timeit res_pd = pd.Series(lookupdict).reindex(arr).values
print all(res == res_pd)

10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 35.3 ms per loop
True

35 Python. Pandas, Series OrderedDict , Python dict. reindex ; , , ( ), , , C Cython. , . , values ​​ , Series.

EDIT: , , Pandas:

keys = np.array(lookupdict.keys())
strings = np.array(lookupdict.values())
%timeit res_np = strings[(np.atleast_2d(arr).T == keys).argmax(axis=1)]
10 loops, best of 3: 44.6 ms per loop

print all(res == res_np)
True
0

Source: https://habr.com/ru/post/1548126/


All Articles