MP3 JSON file is not compatible with 2.2 GB files

I am trying to decode a large utf-8 json file (2.2 GB). I upload the file as follows:

f = codecs.open('output.json', encoding='utf-8') data = f.read() 

If I try to do any of the following: json.load , json.loads or json.JSONDecoder().raw_decode , I get an error message:

 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-40-fc2255017b19> in <module>() ----> 1 j = jd.decode(data) /usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w) 367 end = _w(s, end).end() 368 if end != len(s): --> 369 raise ValueError(errmsg("Extra data", s, end, len(s))) 370 return obj 371 ValueError: Extra data: line 1 column -2065998994 - line 1 column 2228968302 (char -2065998994 - 2228968302) 


uname -m shows x86_64 and

 > python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)' ('7fffffffffffffff', True)` 

so I have to be at 64 bit, and integer size should not be a problem.

However, if I run:

 jd = json.JSONDecoder() len(data) # 2228968302 j = jd.raw_decode(data) j[1] # 2228968302 

The second value in the tuple returned by raw_decode is the end of the line, so raw_decode seems to parse the entire file, seemingly not garbage at the end.

So, is there something I have to do differently with json? Does raw_decode decode the whole file? Why is json.load(s) not working?

+6
source share
1 answer

I would add this as a comment, but the formatting options in the comments are too limited.

Being in the source code,

 raise ValueError(errmsg("Extra data", s, end, len(s))) 

calls this function:

 def errmsg(msg, doc, pos, end=None): ... fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})' return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end) 

The format part (char {5} - {6}) is the part of the error message that you showed:

 (char -2065998994 - 2228968302) 

So, in errmsg() , pos is -2065998994, and end is 2228968302. Here !; -):

 >>> pos = -2065998994 >>> end = 2228968302 >>> 2**32 + pos 2228968302L >>> 2**32 + pos == end True 

That is, pos and end are "really" the same. The back where errmsg() is called from means that end and len(s) really the same, but end treated as a 32-bit signed integer. end in turn, comes from the regular expression matching method end() .

So the real problem here is the 32-bit restriction / assumption in the regexp engine. I recommend that you open a bug report !

Later: to answer your questions, yes, raw_decode() decodes the whole file. Other methods call raw_decode() , but after that add health checks (failing!).

+9
source

Source: https://habr.com/ru/post/955378/


All Articles