How much memory will a list with a million elements contain in Python?

Reddit has over a million subtasks, according to redditmetrics.com .

I wrote a script that repeatedly requests this Reddit API endpoint until all subredds are stored in an array, all_subs :

 all_subs = [] for sub in <repeated request here>: all_subs.append({"name": display_name, "subscribers": subscriber_count}) 

The script runs for about ten hours, and it's about halfway there (it gets a speed limit every three or four requests). When it ends, I expect such an array:

 [ { "name": "AskReddit", "subscribers", 16751677 }, { "name": "news", "subscribers", 13860169 }, { "name": "politics", "subscribers", 3350326 }, ... # plus one million more entries ] 

Approximately how much memory will this list occupy?

+5
source share
1 answer

It depends on your version of Python and your system, but I will let you know how much memory is required. First, sys.getsizeof returns only the memory usage of the object representing the container, and not all elements in the container.

Only the memory consumption directly associated with the object is taken into account, and not the memory consumption of the objects to which it refers.

If set, the default value will be returned; if the object does not provide, it means getting the size. Otherwise, a TypeError will be raised.

getsizeof() calls the __sizeof__ method of the objects and adds additional garbage collector overhead if the object is managed by the garbage collector.

See the recursive sizeof recipe for an example of using getsizeof() recursively find the size of containers and all their contents.

So, I loaded this recipe in an interactive interpreter session:

So the CPython list is actually a heterogeneous, resizable arraist. The main array contains only pointers to Py_Objects. Thus, the pointer occupies the computer's memory. On a 64-bit system, this is 64 bits, so 8 bytes. So, for a container only, a list of 1,000,000 will take approximately 8 million bytes or 8 megabytes. Building a list with 1,000,000 entries:

 In [6]: for i in range(1000000): ...: x.append([]) ...: In [7]: import sys In [8]: sys.getsizeof(x) Out[8]: 8697464 

Extra memory is taken into account due to the overhead of the python object and the extra space that the underlying array leaves at the end to provide efficient .append operations.

Now in Python, the dictionary is pretty heavy. Just a container:

 In [10]: sys.getsizeof({}) Out[10]: 288 

Thus, the lower bound of 1 million dicts is: 288000000 bytes. So, the rough lower bound:

 In [12]: 1000000*288 + 1000000*8 Out[12]: 296000000 In [13]: 296000000 * 1e-9 # gigabytes Out[13]: 0.29600000000000004 

Thus, you can expect about 0.3 gigabytes of memory. Using a recipe and a more realistic dict :

 In [16]: x = [] ...: for i in range(1000000): ...: x.append(dict(name="my name is what", subscribers=23456644)) ...: In [17]: total_size(x) Out[17]: 296697669 In [18]: 

So, about 0.3 gigs. Now this is not much in the modern system. But if you want to save space, you should use tuple or even better, namedtuple :

 In [24]: from collections import namedtuple In [25]: Record = namedtuple('Record', "name subscribers") In [26]: x = [] ...: for i in range(1000000): ...: x.append(Record(name="my name is what", subscribers=23456644)) ...: In [27]: total_size(x) Out[27]: 72697556 

Or, in gigabytes:

 In [29]: total_size(x)*1e-9 Out[29]: 0.07269755600000001 

namedtuple works just like tuple , but you can access fields with names:

 In [30]: r = x[0] In [31]: r.name Out[31]: 'my name is what' In [32]: r.subscribers Out[32]: 23456644 
+6
source

Source: https://habr.com/ru/post/1266721/


All Articles