I have serious slowdowns when working with dictionaries, when the dictionary grows to several thousand keys.
I process a file with ~ 1,000,000 rows of data, I build a graph similar to the data structure using dictionaries
here is my neck function bottle
def create_edge(node_a, node_b, graph):
if node_a not in graph.keys():
graph[node_a] = {node_b: 1}
elif node_b in graph[node_a].keys():
graph[node_a][node_b] += 1
else:
graph[node_a][node_b] = 1
create_edge
will create edges from node_a
to node_b
or add 1 to the weight of an existing edge between them.
Since my nodes are identified by a unique string identifier, I use dictionaries for storage, assuming search if the key exists and the insert will take O (1) on average.
If I comment create_edge
, I can process about 20,000 records per second, and about 20 records per second create_edge
in my pipeline.
100 500 .
10 000. 100 15 000 , create_edge
4 . 400 create_edge
15 , 10 000.
-, ? , , .
-, .
, 100 000 1 000 000 .
:
, noob .:)
keys()
, ( ) , if node in graph.keys()
if node in graph
, 100 ~ 300 .
virtualenv config, , python3, python2.
python3 keys() , , .
.
keys()
.
python version: 3.6.3
start time 11:44:56
Number of records: 1029493
graph created, 1231630 nodes
end time 11:50:35
total ~05:39
python version: 3.6.3
start time 11:54:52
Number of records: 1029493
graph created, 1231630 nodes
end time 12:00:34
total ~05:42
python version: 2.7.10
start time 12:03:25
Number of records: 1029493
graph created, 1231630 nodes
end time 12:09:40
total ~06:15