Python memory usage: txt file is much smaller than python list containing text file

Question

Python memory usage: txt file is much smaller than python list containing text file

I have a 543 MB txt file containing one dividing line, utf-8 tokens:

aaa algeria americansamoa appliedethics accessiblecomputing ada anarchism ...

But when I load this text data into a python list, it uses ~ 8 GB of memory (~ 900 MB for the list and ~ 8 GB for tokens):

with open('tokens.txt', 'r') as f:
    tokens = f.read().decode('utf-8').split()

import sys

print sys.getsizeof(tokens)
# 917450944 bytes for the list
print sum(sys.getsizeof(t) for t in tokens)
# 7067732908 bytes for the actual tokens

I expected that memory usage would be approximately file size + list overhead = 1.5 GB. Why do tokens consume much more memory when loading into a list?

+4

python python-2.7 memory utf-8 nlp

jkarimi Jul 27 '17 at 2:05

source share

1 answer

ShadowRanger · Accepted Answer · 2017-07-27T03:22:12+0000

Two reasons:

CPython C; 64- Python 2 unicode 52 unicode, . 1.14M unicode ( u''), 6 .
Python 2 decode str unicode, Python 2 2 4 , ASCII ; , 4 /char. , , 543 , 2 .

( Python ); Python ( , sys.getsizeof(u'') x64 52, "" , str).

ASCII, , Python 3; Py3 (3.3+ IIRC) str; a str, ASCII/latin-1, (-1 , ASCII, 1), ( - , , , BMP, ). str (sys.getsizeof('') == 49, 52), 350 1,5 ( ASCII).

Py 3 :

with open('tokens.txt', 'r', encoding='utf-8') as f:
    tokens = f.read().split()

import sys

print(sys.getsizeof(tokens))
print(sum(sys.getsizeof(t) for t in tokens))

, , (, x64 Linux, u'examplestring' 104 Py2, 4 /char unicode, 62 Py3).

unicode str Py2, ASCII; Py2 , str (37 52) /char. unicode ASCII , . , :

# Open in binary mode
with open('tokens.txt', 'rb') as f:
    # Defer decode and only do it for str with non-ASCII bytes
    # producing list of mostly ASCII str with a few unicode objects
    # when non-ASCII appears
    tokens = [w.decode('utf-8') if max(w) > '\x7f' else w
              for w in f.read().split()]

import sys

print sys.getsizeof(tokens)
print sum(sys.getsizeof(t) for t in tokens)

~ 1,7 1,5 str/unicode, Py2 ( bytes str Py 3).

Python memory usage: txt file is much smaller than python list containing text file

More articles: