Why does a dictionary use so much RAM in Python

I wrote a python script that read the contents of two files, the first a relatively small file (~ 30 KB), and the second a larger file ~ 270 MB. The contents of both files are loaded into the dictionary data structure. When the second file loads, I expected that the amount of RAM should be approximately equivalent to the size of the file on the disk, possibly with some overhead, but, watching the use of RAM on my PC, it seems that it consistently takes ~ 2 GB (about V 8 times the file size). The corresponding source code is below (pauses are inserted so that I can see RAM usage at each stage). The line consuming large amounts of memory is "tweets = map (json.loads, tweet_file)":

def get_scores(term_file): global scores for line in term_file: term, score = line.split("\t") #tab character scores[term] = int(score) def pause(): tmp = raw_input('press any key to continue: ') def main(): # get terms and their scores.. print 'open word list file ...' term_file = open(sys.argv[1]) pause() print 'create dictionary from word list file ...' get_scores(term_file) pause() print 'close word list file ...' term_file.close pause() # get tweets from file... print 'open tweets file ...' tweet_file = open(sys.argv[2]) pause() print 'create dictionary from word list file ...' tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet) pause() print 'close tweets file ...' tweet_file.close pause() 

Does anyone know why this is? My concern is that I would like to expand my research to larger files, but will quickly run out of memory. Interestingly, memory usage does not increase significantly after opening the file (as I think it just creates a pointer).

I have an idea to try to scroll the file one line at a time and process what I can, and only store the minimum that I need for a future link, and not load everything into a list of dictionaries, but I was just wondering see if approximately Does the 8-fold factor in file size in memory when creating a dictionary correspond to the experience of other people?

+4
source share
3 answers

I assume that you have several copies on your dictionnary that are simultaneously stored in memory (in different formats). As an example, the line:

 tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet) 

Creates a new copy (+ 400 ~ 1000 MB, including a dictionary). But your original tweet_file remains in memory. Why such big numbers? Well, if you work with Unicode strings, each Unicode character uses 2 or 4 bytes in memory. If in your file, assuming UTF-8 encoding, most characters use only 1 byte. If you work with regular strings in Python 2, the size of the string in memory should be almost the same as the size on disk. Therefore, you will have to find another explanation.

EDIT: The actual number of bytes occupied by a "character" in Python 2 may vary. Here are some examples:

 >>> import sys >>> sys.getsizeof("") 40 >>> sys.getsizeof("a") 41 >>> sys.getsizeof("ab") 42 

As you can see, it seems that each character is encoded as one byte. But:

 >>> sys.getsizeof("à") 42 

Not for the "French" characters. AND...

 >>> sys.getsizeof("世") 43 >>> sys.getsizeof("世界") 46 

For the Japanese, we have 3 bytes per character.

The above results depend on the site and are explained by the fact that my system uses UTF-8 for encoding by default. The "line size" calculated above is actually the "byte line size" representing the given text.

If "json.load" uses "unicode" strings, the result is somehow different:

 >>> sys.getsizeof(u"") 52 >>> sys.getsizeof(u"a") 56 >>> sys.getsizeof(u"ab") 60 >>> sys.getsizeof(u"世") 56 >>> sys.getsizeof(u"世界") 60 

In this case, as you can see, each additional character adds 4 additional bytes.


Maybe the file object will cache some data? If you want to cause an explicit selection of an object, try setting its link to None:

 tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet) [...] tweet_file.close() tweet_file = None 

When object references are no longer being processed, Python will delact it - and therefore free the appropriate memory (from the Python heap - I don't think that memory is returning to the system).

+2
source

I wrote a quick test script to confirm your results ...

 import sys import os import json import resource def get_rss(): return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * 1024 def getsizeof_r(obj): total = 0 if isinstance(obj, list): for i in obj: total += getsizeof_r(i) elif isinstance(obj, dict): for k, v in obj.iteritems(): total += getsizeof_r(k) + getsizeof_r(v) else: total += sys.getsizeof(obj) return total def main(): start_rss = get_rss() filename = 'foo' f = open(filename, 'r') l = map(json.loads, f) f.close() end_rss = get_rss() print 'File size is: %d' % os.path.getsize(filename) print 'Data size is: %d' % getsizeof_r(l) print 'RSS delta is: %d' % (end_rss - start_rss) if __name__ == '__main__': main() 

... which prints ...

 File size is: 1060864 Data size is: 4313088 RSS delta is: 4722688 

... so I get only a fourfold increase, because each Unicode char takes up four bytes of RAM.

Perhaps you can test your input file with this script, since I cannot explain why you are getting an eightfold increase with the script.

+1
source

Have you considered using memory for keys? If your dictionary has a lot of small meanings, key storage may dominate.

0
source

Source: https://habr.com/ru/post/1488218/


All Articles