Reduce the memory used by the large voice recorder

Question

Reduce the memory used by the large voice recorder

I need to create an object in memory that has keys that are a 9-digit integer and a boolean associated with each key. I used dict, as in a simplified example below:

#!/usr/bin/python from __future__ import print_function import sys myDict = {} for n in range(56000): myDict[n] = True print('Count:',len(myDict),' Size:', sys.getsizeof(myDict))

I need to be able to search and retrieve the logical value associated with each key. The problem is the size of the dict. Using Python 2.7 on a 64-bit Linux system and above, the dict size is 3.1 megabytes according to sys.getsizeof (). (about 56 bytes per record for storing 9 digits plus a logical value)

I need to save the logical state of (approximately) 55,000 entries in a dict. Each voice recorder key is a 9-digit integer. I tried using an integer and str (theInteger) as keys without changing the size of the dict.

Is there any other data structure or methodology that I should use to save memory with such a large dataset?

+6

python python-2.7

Royhb Sep 04 '13 at 13:19

source share

5 answers

Accepted Python answer : reducing memory usage in dictionaries suggests that you can't do much there, and I agree with that. The total overhead of the dictionary is small, but the number of key-value pairs in your example increases the memory area.

You might think that it is possible: if the keys are always linear, you can create a list of logical elements directly or it is better to use bitarray . Then the keys will be implicit. But if this is only in your example, then you cannot do much.

+3

tea2code Sep 04 '13 at 13:24

source share

Why not use a giant bitfield? You encode your data in two bits, since you need at least three values: true, false, and not_init / UB. The total used memory will be 55.000*2 bits = 110 000 bits = 13 kBytes .

The installation flag is here to ensure that the value has been correctly set by the user (optional), and the second bit contains the value.

Using 64 bit unsigned integers , you only need 203 to store the entire array.

Then you can access it using the bit index: let you want to access the value at index 123 . you will need to access bit #246 ans #247 (one for the bool set and one for the value).

Since 246 and 247 are inferior to 2**64 , they are stored on the first uint . To access them:

 return (( (1<<246) & array[0] ) >> 246 )

To access any bit:

 return (( (1<<n) & array[ n/(2**64) ] ) >> n)

(not tested accessory bit)

Set bit:

 array[ n/(2**64) ] = array[ n/(2**64) ] | (1<<n)

Bitwise operations are complex (arithmetic offset versus logical) and are not easy to debug, but they can be extremely powerful.

+1

lucasg Sep 04 '13 at 13:43

source share

If the "key is not found" is not important to you (for example, you are fine with key processing not in an array like False ), you can use set instead to store only the element mapping for True . This requires approximately 30% less space, because each record consists of only two 64-bit values (hash and key) instead of three values (hash, key, value).

Keep in mind that sys.getsizeof(dict) only reports the size of the dict itself, not the objects contained inside. Creating 56000 int , because the keys will also carry their own cost: 24 bytes per integer (type pointer, refcount, value). This will be 1.3 MB separately, in addition to the memory occupied by the dictionary.

To really save space, you can use the NumPy compressed sparse matrix:

 from scipy.sparse import lil_matrix # linked-list matrix, used to construct the csr matrix vals = lil_matrix((1,1000000000), dtype='int8')) # We will use 0 = no such key, 1 = True, 2 = False for n in myIndices: vals[n] = 1 vals = vals.tocsr()

The vals memory usage is very small: 56 KB for data, 224 KB for indexes, and less than 1 KB for other structures. Thus, the total size is less than 281 KB (10 times smaller than that of dict), without extra integers. Finding items up and changing non-zero items is very fast (binary search in a sorted array), but inserting a new non-zero value or zeroing an existing non-zero value is expensive.

+1

nneonneo Sep 04 '13 at 13:54

source share

Depending on your needs, you can use a list to store your values. This will only use about 16% of the space that the dictionary uses, but some operations, such as searching and inserting, will be (possibly a lot) slower.

 values = list(range(56000))

If you use the bisect module and save your values in a sorted list, your searches will still be slower than using a dict, but much faster than the naive x in my_list .

The list should always be stored in sorted order. To check if there is a value in your list, you can use this function:

 def is_in_list(values, x): i = bisect_left(values, x) return i != len(values) and values[i] == x

It works as follows:

 >>> is_in_list([2, 4, 14, 15], 4) True >>> is_in_list([2, 4, 14, 15], 1) False >>> is_in_list([2, 4, 14, 15], 13) False

This method will significantly reduce memory usage, but - compared to dict or set - the search requires O (log n), not O (1), and the insert takes O (n) instead of O (1).

0

flornquake Sep 04 '13 at 16:54

source share

Christian schramm · Accepted Answer · 2013-09-04T13:32:20+0000

If you are viewing your logical key with an integer key, and the range of keys starts at 0 and is continuous, there is no reason not to use a list:

 my_list = [] for n in range(56000): my_list[n] = True

or better:

 my_list = [True for n in range(5600])

If this is not enough, try the array module and use one byte per bool:

 import array my_array = array.array("b", (True for n in range(56000)))

And if that's not enough, try it .

Another idea is to use set : let's say you have a lot more False than True , just set the set:

 my_true_numbers = {0, 2323, 23452} # just the True ones

and check with

 value = number in my_true_numbers

If you have more True than False , do the opposite.

Reduce the memory used by the large voice recorder

More articles: