Storing 7 millimeter python key dictionary in database

I need to process a dictionary with 7 million keys (the number of keys can ultimately be up to 50 million). Since I had enough bar left to keep it in my memory, I decided to keep it.

My dictionary is as follows:

dictionary={(int1,int2):int3,...} 

At first I tried to save it in sqlite database using sqlite3. The amount of time required to store it is perfectly normal (about 70 seconds). Using timeit :

 >>>import sqlite3 >>>conn=sqlite3.connect('test_sqlite.sqlite') >>>c=conn.cursor() >>>c.execute('create table test (int1 int, int2 int, int3 int)') >>>conn.commit() >>>conn.close() >>>import timeit >>>timeit.timeit('c.executemany("insert into test values (?,?,?)",((key[0],key[1],dictionary[key]) for key in dictionary.iterkeys())),setup='import sqlite3;conn=sqlite3.connect("test_sqlite.sqlite");c=conn.cursor();dictionary={(i,i+1):i+2 for i in xrange(7000000)}',number=1) 70.7033872604 

But then I need to use this saved dictionary to get specific values, but each SELECT seems to take about 1.5 seconds. Since I need to access about one million values, this is discouraging:

 >>>timeit.timeit('c.execute("select id1 from test where id2=={}".format(value)).fetchone()[0]',setup=import sqlite3;conn=sqlite3.connect("test_sqlite.sqlite");c=conn.cursor();value=5555',number=1) 1.5300869941711426 

Then I tried to update the dictionary on the shelf. Now the amount of time it takes to get the value in my dictionary with shelves is pretty good:

 >>> timeit.timeit('a=f[key]',setup='import shelve;f=shelve.open("test_timeit","r");key="1000"',number=10000) 0.320019006729126 

So, although I am making several million requests like this, the total amount of time should be about one hundred seconds.

But a new problem arose, since now the time required to store my dictionary on a shelf does not satisfy me.

 >>> timeit.timeit('f.update(dictio)',setup='import shelve;f=shelve.open("test_timeit","c");dictio={"({},{})".format(i,i+1):i+2 for i in xrange(7000000)}',number=1) 504.728841782 

You must add to this amount the extra time needed to convert the old keys (which are tuples) into a string. Using View:

 >>>timeit.timeit('repr.repr((1,2))',setup='import repr',number=7000000) 61.6035461426 

A total of 566.332387924 updates the dictionary on the shelf ...

I do not want to sort the dictionary, as this implies that I will have to download the entire dictionary if I want to use it later.

Is there a way to improve one of these two methods in order to have better access time / load time?

Thank you for your help!

+4
source share
1 answer

For queries on large tables like this, to quickly return, you need to index the corresponding columns. In your case, I would add this as a primary key.

 create table test ( Int1 integer, Int2 integer, Int3 integer, Primary key (int1, int2) ) 
+3
source

Source: https://habr.com/ru/post/1494510/


All Articles