A memory-based memory card in Python (or C)

I need a data structure with memory efficiency for storing about a million key-value pairs, where the keys are strings of the order of 80 bytes, and the values ​​are strings of the order of 200 bytes, the common key and the size of the value are about 280 MB, I also need an effective search for the value by the key words, preferably a hash map. The memory overhead should be as small as possible, for example. for 280 MB of useful data, the data structure should not use more than 300 MB of virtual memory (includingmalloc()official and everything else). The usage pattern is as follows: we start with an empty data structure and we fill it up gradually, never change keys and never change the length of the values. As a plus, the data structure can support a change in the length of the values ​​due to the 100% overhead cost (which means that for bytes, x x bytes can be temporarily lost in an unused buffer space).

I need a pure Python module or Python built-in module or a C implementation, preferably with (C) Python bindings. I would prefer that you can serialize the entire data structure to disk and quickly read it.

To prove that such small overheads are possible, I created a simple design with open addressing , a hash table of 1.25 million elements containing 4-byte pointers for 1 MB data blocks, data blocks containing a key length and values ​​like base -128 varints . This project has an important limitation: it does not allow you to delete or modify pairs without wasting their memory area. According to my calculations with 1 million key-value pairs of 280 bytes each, the overhead is less than 3.6% (10,080,000 bytes). The limits above are more generous; they allow 20 million bytes of overhead.

I just found http://www.pytables.org/ that provides quick access and cost-effective data packaging. I need to study it more carefully to see if it suits my needs.

+3
source share
10 answers

Since I could not find any existing solutions that pack memory, I decided to implement it in C for myself. See my open-addressing project in question.

0
source

Martijn ( , ), : SQLite. , .

+6

, . .

. 80 . ? , .

, .

:

struct {
    char value[80];
    char *data;
} key;

.

, :

struct link {
    char *data;
    link *next;
}

struct {
    char value[80];
    link *data;
} key;

( C , ). , .

- . "" / . , , , 64- .

, , - . , , . 64- 8 . , 8 .

, , ( "", , malloc (1000000 * sizeof (key)) , ).

, , . cpus 100M .

, - Java. 64- JVM 25- - 2G . ( ) 600 ). Java , C, .

+6

, .

python . python 1 -, 80 200 . 360,844 , 300 , , .

C API. , C, Python C, Python, , .

. cPickle. , , -. :

cPickle.dump(mydict, "myfile.pkl")

:

mydict = cPickle.load("myfile.pkl")

- shelve, python. ( ). .

+6

? , .

+5

sha1 , . , , , sha1. , .

from random import choice
from string import letters
from hashlib import sha1

def keygen(length):
    return "".join(choice(letters) for _ in xrange(length))

def gentestdata(n=1000*1000):
    # return dict((sha1(keygen(80)).digest(), keygen(200)) for _ in xrange(n))
    d = {}
    for _ in xrange(n):
        key = sha1(keygen(80)).digest()
        assert key not in d
        value = keygen(200)
        d[key] = value
    return d

if __name__ == '__main__':
    d = gentestdata()

ubuntu 304 :

2010-10-26 14:26:02 hbrown@hbrown-ubuntu-wks:~$ ps aux | grep python
[...]
hbrown   12082 78.2  7.5 307420 303128 pts/1   S+   14:20   4:47 python

? python, C.


: , , gzip . .

+2

SQLite - . , .


, , :

?
?

- . ( , ).

, -, - , .

, - cmp/copy .. , , :

  • () 0
  • "" , .

. , , / . "-" C, , ? ? .


, , , , -. . ( , - "" , . , , .)

, . , , , . , , , , .


" " /Unicode, .

+2

Apache Portable Runtime (aka APR) - c. http://apr.apache.org/docs/apr/0.9/group_apr_hash.html

apr_hash_t , , *. , . SO, , 100- .

+1

Source: https://habr.com/ru/post/1771551/


All Articles