I need to process a dictionary with 7 million keys (the number of keys can ultimately be up to 50 million). Since I had enough bar left to keep it in my memory, I decided to keep it.
My dictionary is as follows:
dictionary={(int1,int2):int3,...}
At first I tried to save it in sqlite database using sqlite3. The amount of time required to store it is perfectly normal (about 70 seconds). Using timeit :
>>>import sqlite3 >>>conn=sqlite3.connect('test_sqlite.sqlite') >>>c=conn.cursor() >>>c.execute('create table test (int1 int, int2 int, int3 int)') >>>conn.commit() >>>conn.close() >>>import timeit >>>timeit.timeit('c.executemany("insert into test values (?,?,?)",((key[0],key[1],dictionary[key]) for key in dictionary.iterkeys())),setup='import sqlite3;conn=sqlite3.connect("test_sqlite.sqlite");c=conn.cursor();dictionary={(i,i+1):i+2 for i in xrange(7000000)}',number=1) 70.7033872604
But then I need to use this saved dictionary to get specific values, but each SELECT seems to take about 1.5 seconds. Since I need to access about one million values, this is discouraging:
>>>timeit.timeit('c.execute("select id1 from test where id2=={}".format(value)).fetchone()[0]',setup=import sqlite3;conn=sqlite3.connect("test_sqlite.sqlite");c=conn.cursor();value=5555',number=1) 1.5300869941711426
Then I tried to update the dictionary on the shelf. Now the amount of time it takes to get the value in my dictionary with shelves is pretty good:
>>> timeit.timeit('a=f[key]',setup='import shelve;f=shelve.open("test_timeit","r");key="1000"',number=10000) 0.320019006729126
So, although I am making several million requests like this, the total amount of time should be about one hundred seconds.
But a new problem arose, since now the time required to store my dictionary on a shelf does not satisfy me.
>>> timeit.timeit('f.update(dictio)',setup='import shelve;f=shelve.open("test_timeit","c");dictio={"({},{})".format(i,i+1):i+2 for i in xrange(7000000)}',number=1) 504.728841782
You must add to this amount the extra time needed to convert the old keys (which are tuples) into a string. Using View:
>>>timeit.timeit('repr.repr((1,2))',setup='import repr',number=7000000) 61.6035461426
A total of 566.332387924 updates the dictionary on the shelf ...
I do not want to sort the dictionary, as this implies that I will have to download the entire dictionary if I want to use it later.
Is there a way to improve one of these two methods in order to have better access time / load time?
Thank you for your help!