Two reasons:
CPython C; 64- Python 2 unicode 52 unicode, . 1.14M unicode ( u''), 6 .
Python 2 decode str unicode, Python 2 2 4 , ASCII ; , 4 /char. , , 543 , 2 .
( Python ); Python ( , sys.getsizeof(u'') x64 52, "" , str).
ASCII, , Python 3; Py3 (3.3+ IIRC) str; a str, ASCII/latin-1, (-1 , ASCII, 1), ( - , , , BMP, ). str (sys.getsizeof('') == 49, 52), 350 1,5 ( ASCII).
Py 3 :
with open('tokens.txt', 'r', encoding='utf-8') as f:
tokens = f.read().split()
import sys
print(sys.getsizeof(tokens))
print(sum(sys.getsizeof(t) for t in tokens))
, , (, x64 Linux, u'examplestring' 104 Py2, 4 /char unicode, 62 Py3).
unicode str Py2, ASCII; Py2 , str (37 52) /char. unicode ASCII , . , :
with open('tokens.txt', 'rb') as f:
tokens = [w.decode('utf-8') if max(w) > '\x7f' else w
for w in f.read().split()]
import sys
print sys.getsizeof(tokens)
print sum(sys.getsizeof(t) for t in tokens)
~ 1,7 1,5 str/unicode, Py2 ( bytes str Py 3).