Dictionary and Insert Search

I have serious slowdowns when working with dictionaries, when the dictionary grows to several thousand keys.

I process a file with ~ 1,000,000 rows of data, I build a graph similar to the data structure using dictionaries

here is my neck function bottle

def create_edge(node_a, node_b, graph):
    if node_a not in graph.keys():
        graph[node_a] = {node_b: 1}
    elif node_b in graph[node_a].keys():
        graph[node_a][node_b] += 1
    else:
        graph[node_a][node_b] = 1

create_edgewill create edges from node_ato node_bor add 1 to the weight of an existing edge between them.

Since my nodes are identified by a unique string identifier, I use dictionaries for storage, assuming search if the key exists and the insert will take O (1) on average.

If I comment create_edge, I can process about 20,000 records per second, and about 20 records per second create_edgein my pipeline.

100 500 . 10 000. 100 15 000 , create_edge 4 . 400 create_edge 15 , 10 000.

-, ? , , .

-, .

, 100 000 1 000 000 .


:

, noob .:)

keys() , ( ) , if node in graph.keys() if node in graph, 100 ~ 300 .

virtualenv config, , python3, python2.

python3 keys() , , .

.


keys().

# graph = {}
python version: 3.6.3
start time 11:44:56
Number of records: 1029493
graph created, 1231630 nodes
end time 11:50:35
total ~05:39

# graph = defaultdict(lambda : defaultdict(int))
python version: 3.6.3
start time 11:54:52
Number of records: 1029493
graph created, 1231630 nodes
end time  12:00:34
total ~05:42

# graph = {}
python version: 2.7.10
start time 12:03:25
Number of records: 1029493
graph created, 1231630 nodes
end time 12:09:40
total ~06:15
+4
3

dict key in d, key in d.keys(). dict .

:

def create_edge(node_a, node_b, graph):
    if node_a not in graph:
        graph[node_a] = {node_b: 1}
    elif node_b in graph[node_a]:
        graph[node_a][node_b] += 1
    else:
        graph[node_a][node_b] = 1

, keys() . , , .

, Python 2 keys() , Python 3, Python 2 keys() . - Python 3, Python 3 , keys(), .

+2

defaultdict defaultdict (int), : Python: defaultdict defaultdict?

from collections import defaultdict

graph = defaultdict(lambda : defaultdict(int))

graph['a']['b'] += 1
graph['a']['b'] += 1
graph['a']['c'] += 1

graph

:

defaultdict(<function __main__.<lambda>>,
            {'a': defaultdict(int, {'b': 2, 'c': 1})})
# equal to: {'a': {'b': 2, 'c': 1}}
+3

, . , , . @Stefan Pochmann . ideone.com/ckF0X5

Python 3.6, .

from timeit import timeit
from collections import defaultdict, Counter
from random import shuffle
from itertools import product

def f():   # OP method modified with Tom Karzes' answer above.
    d = {}
    for i, j in edges:
        if i not in d:
            d[i] = {j: 1}
        elif j in d[i]:
            d[i][j] += 1
        else:
            d[i][j] = 1

def count_first(): 
    d = dict()
    for (v, w), c in Counter(edges).items():
        if v not in d:
            d[v] = {w: c}
        else:
            d[v][w] = c
    # Alternatively, (Thanks to Jean-François Fabre to point it out.)
    # d = defaultdict(lambda : defaultdict(int)) 
    # for (v, w), c in Counter(edges).items(): 
    #     d[v][w] = c

edges = list(product(range(300), repeat=2)) * 10
shuffle(edges)

# %timeit f()
270 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# %timeit count_first()
180 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Declaimer: count_first(), ideone.com, , , OP, f() .

Python 2, 3. Python 2 . Python 3 this. .

+2

Source: https://habr.com/ru/post/1691572/


All Articles