Saving a file in lists uses 10x memory as file size

I have an ASCII file, which is essentially a grid of 16-bit integers; The file size on disk is about 300 MB. I do not need to read the file in memory, but I need to save its contents as a single container (s), so for the initial testing of memory use, I tried listand tuplesas internal containers with an external container always as listthrough list comprehension:

with open(file, 'r') as f:
    for _ in range(6):
        t = next(f) # skipping some header lines
    # Method 1
    grid = [line.strip().split() for line in f] # produces a 3.3GB container
    # Method 2 (on another run)
    grid = [tuple(line.strip().split()) for line in f] # produces a 3.7GB container

Having discussed the use of the grid among the command, I need to save it as a list of lists up to a certain point, after which I will convert it to a list of tuples for running the program.

What I'm interested in is how a 300 MB file can have its own lines stored in a container container, and its total size should be 10 times the original file size. Does each container really take up so much space to hold one line?

+4
source share
1 answer

If you are worried about storing data in memory and do not want to use tools outside the standard library, you can take a look at the module array. It is designed to store numbers very efficiently in memory, and the class array.arrayaccepts different type codes based on the characteristics of the numbers you want to keep. The following is a simple demonstration of how you can adapt the module for your use:

#! /usr/bin/env python3
import array
import io
import pprint
import sys

CONTENT = '''\
Header 1
Header 2
Header 3
Header 4
Header 5
Header 6
 0 1 2 3 4 -5 -6 -7 -8 -9 
 -9 -8 -7 -6 -5 4 3 2 1 0 '''


def main():
    with io.StringIO(CONTENT) as file:
        for _ in range(6):
            next(file)
        grid = tuple(array.array('h', map(int, line.split())) for line in file)
    print('Grid takes up', get_size_of_grid(grid), 'bytes of memory.')
    pprint.pprint(grid)


def get_size_of_grid(grid):
    return sys.getsizeof(grid) + sum(map(sys.getsizeof, grid))


if __name__ == '__main__':
    main()
+1
source

Source: https://habr.com/ru/post/1691981/


All Articles