Loading pickled python object uses huge amount of memory

I have a python pickled object that generates a 180 MB file. When I unpack it, memory usage explodes up to 2 or 3 GB. Do you have a similar experience? This is normal?

An object is a tree containing a dictionary: each edge is a letter, and each node is a potential word. Therefore, to store a word, you need as many edges as the length of the word. So, the first level is 26 node maximum, the second is 26 ^ 2, the third is 26 ^ 3, etc. For each node that is a word, I have an attribute pointing to information about the word (verb, noun, definition, etc.).

I have words with a maximum of 40 characters. I have about half a million records. Everything goes well until I marina (using a simple cpickle dump): it gives a file of 180 MB in size. I am on Mac OS, and when I unlock these 180 MB, the OS gives 2 or 3 GB of β€œmemory / virtual memory” for the python process :(

I do not see recursion on this tree: edges have nodes that have an array of arrays. There was no recursion.

I'm a little stuck: loading these 180 MB is about 20 seconds (not to mention the memory issue). I have to say that my processor is not so fast: i5 core, 1.3 GHz. But my hard drive is ssd. I have only 4 GB of memory.

To add these 500,000 words to my tree, I read about 7,000 files containing about 100 words. Doing this reading makes the memory allocated by mac os up to 15 GB, mainly in virtual memory :( I used the "c" operator to ensure that each file was closed, but it really didn't help. 0.2 sec for 40 K. It seems pretty long to me by adding it to the tree much faster (0.002 sec).

Finally, I wanted to create a database of objects, but I think python is not suitable for this. Maybe I'll go to MongoDB :(

class Trie(): """ Class to store known entities / word / verbs... """ longest_word = -1 nb_entree = 0 def __init__(self): self.children = {} self.isWord = False self.infos =[] def add(self, orthographe, entree): """ Store a string with the given type and definition in the Trie structure. """ if len(orthographe) >Trie.longest_word: Trie.longest_word = len(orthographe) if len(orthographe)==0: self.isWord = True self.infos.append(entree) Trie.nb_entree += 1 return True car = orthographe[0] if car not in self.children.keys(): self.children[car] = Trie() self.children[car].add(orthographe[1:], entree) 
+5
source share
2 answers

Python objects, especially on a 64-bit machine, are very large. When pickled, the object gets a very compact representation, which is suitable for a file on disk. Here is an example of a sorted brine:

 >>> pickle.dumps({'x':'y','z':{'x':'y'}},-1) '\x80\x02}q\x00(U\x01xq\x01U\x01yq\x02U\x01zq\x03}q\x04h\x01h\x02su.' >>> pickletools.dis(_) 0: \x80 PROTO 2 2: } EMPTY_DICT 3: q BINPUT 0 5: ( MARK 6: U SHORT_BINSTRING 'x' 9: q BINPUT 1 11: U SHORT_BINSTRING 'y' 14: q BINPUT 2 16: U SHORT_BINSTRING 'z' 19: q BINPUT 3 21: } EMPTY_DICT 22: q BINPUT 4 24: h BINGET 1 26: h BINGET 2 28: s SETITEM 29: u SETITEMS (MARK at 5) 30: . STOP 

As you can see, it is very compact. Nothing is repeated if possible.

However, in memory, an object consists of a fairly significant number of pointers. Let me ask Python how big the empty dictionary is (64-bit machine):

 >>> {}.__sizeof__() 248 

Wow! 248 bytes for an empty dictionary! Please note that the dictionary contains pre-allocated spaces for up to eight elements. However, you pay the same cost of memory, even if the dictionary has one element.

An instance of a class contains one dictionary for storing instance variables. Your attempts have an additional dictionary for children. Thus, each instance costs about 500 bytes. With an estimated 2-4 million Trie objects, you can easily find out where your memory usage came from.


You can mitigate this a bit by adding __slots__ to your Trie to exclude the instance dictionary. You will probably save about 750 MB by doing this (my hunch). This will not allow you to add more variables to Trie, but this is probably not a big problem.

+2
source

Do you really need to load or reset all this in memory at once? If you do not need all this in memory, but only the selected parts that you want at any given time, you may need to map your dictionary to a set of files on disk instead of a single file ... or map a dict to a database table. So, if you are looking for something that saves large dictionaries of data on disk or in a database and can use etching and encoding (codecs and hash cards), you can look at klepto .

klepto provides a dictionary abstraction for writing to a database, including treating your file system as a database (i.e. writing the entire dictionary to a single file or writing each record to its own file). For big data, I often prefer to present the dictionary as a directory in my file system, and each one should be a file. klepto also offers caching algorithms, so if you use a file system file system for a dictionary, you can avoid some speed reduction by using memory caching.

 >>> from klepto.archives import dir_archive >>> d = {'a':1, 'b':2, 'c':map, 'd':None} >>> # map a dict to a filesystem directory >>> demo = dir_archive('demo', d, serialized=True) >>> demo['a'] 1 >>> demo['c'] <built-in function map> >>> demo dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True) >>> # is set to cache to memory, so use 'dump' to dump to the filesystem >>> demo.dump() >>> del demo >>> >>> demo = dir_archive('demo', {}, serialized=True) >>> demo dir_archive('demo', {}, cached=True) >>> # demo is empty, load from disk >>> demo.load() >>> demo dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True) >>> demo['c'] <built-in function map> >>> 

klepto also has other flags, such as compression and memmode , which can be used to configure how your data is stored (e.g. compression level, memory card mode, etc.). It is equally simple (the same exact interface) to use a database (MySQL, etc.) as a backend instead of your file system. You can also disable memory caching, so each read / write goes directly to the archive, just setting cached=False .

klepto also provides many caching algorithms (e.g. mru , lru , lfu , etc.) to help you manage the cache in memory and will use the algorithm to dump and upload to the archive server for you.

You can use the cached=False flag to completely disable memory caching, as well as directly read and write to disk or from the database and vice versa. If your records are large enough, you can select a record to disk, where you put each record in its own file. Here is an example that does both.

 >>> from klepto.archives import dir_archive >>> # does not hold entries in memory, each entry will be stored on disk >>> demo = dir_archive('demo', {}, serialized=True, cached=False) >>> demo['a'] = 10 >>> demo['b'] = 20 >>> demo['c'] = min >>> demo['d'] = [1,2,3] 

However, while this should significantly reduce boot time, it can slow down overall execution down a bit ... it's usually best to specify the maximum amount to be stored in the memory cache and choose a good caching algorithm. You must play with him to get the right balance for your needs.

Get klepto here: https://github.com/uqfoundation

+2
source

Source: https://habr.com/ru/post/1200331/


All Articles