I would add this as a comment, but the formatting options in the comments are too limited.
Being in the source code,
raise ValueError(errmsg("Extra data", s, end, len(s)))
calls this function:
def errmsg(msg, doc, pos, end=None): ... fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})' return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end)
The format part (char {5} - {6}) is the part of the error message that you showed:
(char -2065998994 - 2228968302)
So, in errmsg() , pos is -2065998994, and end is 2228968302. Here !; -):
>>> pos = -2065998994 >>> end = 2228968302 >>> 2**32 + pos 2228968302L >>> 2**32 + pos == end True
That is, pos and end are "really" the same. The back where errmsg() is called from means that end and len(s) really the same, but end treated as a 32-bit signed integer. end in turn, comes from the regular expression matching method end() .
So the real problem here is the 32-bit restriction / assumption in the regexp engine. I recommend that you open a bug report !
Later: to answer your questions, yes, raw_decode() decodes the whole file. Other methods call raw_decode() , but after that add health checks (failing!).