How to handle a huge stream of JSON dictionaries?

Question

How to handle a huge stream of JSON dictionaries?

I have a file that contains a stream of JSON dictionaries, such as:

{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}

It also includes nested dictionaries, and it looks like I can't rely on a new line, which is a delimiter. I need a parser that can be used as follows:

 for d in getobjects(f): handle_dict(d)

The fact is that it would be ideal if the iteration occurred only at the root level. Is there a Python parser that will handle all the JSON quirks? I am interested in a solution that will work with files that do not fit in RAM.

+6

json python

d33tah Jun 12 '15 at 17:37

source share

3 answers

Brien · Answer 1 · 2015-06-12T17:51:56+0000

I think JSONDecoder.raw_decode may be what you are looking for. You may need to format the strings to get it in perfect format depending on new lines, etc., but with a little work you can probably do something. See this example.

 import json jstring = '{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}' substr = jstring decoder = json.JSONDecoder() while len(substr) > 0: data,index = decoder.raw_decode(substr) print data substr = substr[index:]

Gives output:

 {u'menu': u'a'} {u'c': []} {u'd': [3, 2]} {u'e': u'}'}

steveha · Answer 2 · 2015-06-13T06:59:11+0000

Here you go: a proven solution based on answer from @Brien

This should be able to handle any arbitrary size of the input file. This is a generator, so it leads to the creation of dictionary objects one at a time, since it parses them from the input JSON file.

If you run it as standalone, it runs three test cases. (In the if __name__ == "__main__" block)

Of course, to make it read from standard input, you simply pass sys.stdin as an argument to the input file.

 import json _DECODER = json.JSONDecoder() _DEFAULT_CHUNK_SIZE = 4096 _MB = (1024 * 1024) _LARGEST_JSON_OBJECT_ACCEPTED = 16 * _MB # default to 16 megabytes def json_objects_from_file(input_file, chunk_size=_DEFAULT_CHUNK_SIZE, max_size=_LARGEST_JSON_OBJECT_ACCEPTED): """ Read an input file, and yield up each JSON object parsed from the file. Allocates minimal memory so should be suitable for large input files. """ buf = '' while True: temp = input_file.read(chunk_size) if not temp: break # Accumulate more input to the buffer. # # The decoder is confused by leading white space before an object. # So, strip any leading white space if any. buf = (buf + temp).lstrip() while True: try: # Try to decode a JSON object. x, i = _DECODER.raw_decode(buf) # If we got back a dict, we got a whole JSON object. Yield it. if type(x) == dict: # First, chop out the JSON from the buffer. # Also strip any leading white space if any. buf = buf[i:].lstrip() yield x except ValueError: # Either the input is garbage or we got a partial JSON object. # If it a partial, maybe appending more input will finish it, # so catch the error and keep handling input lines. # Note that if you feed in a huge file full of garbage, this will grow # very large. Blow up before reading an excessive amount of data. if len(buf) >= max_size: raise ValueError("either bad input or too-large JSON object.") break buf = buf.strip() if buf: if len(buf) > 70: buf = buf[:70] + '...' raise ValueError('leftover stuff from input: "{}"'.format(buf)) if __name__ == "__main__": from StringIO import StringIO jstring = '{"menu":\n"a"}{"c": []\n}\n{\n"d": [3,\n 2]}{\n"e":\n "}"}' f = StringIO(jstring) correct = [{u'menu': u'a'}, {u'c': []}, {u'd': [3, 2]}, {u'e': u'}'}] result = list(json_objects_from_file(f, chunk_size=3)) assert result == correct f = StringIO(' ' * (17 * _MB)) correct = [] result = list(json_objects_from_file(f, chunk_size=_MB)) assert result == correct f = StringIO('x' * (17 * _MB)) correct = "ok" try: result = list(json_objects_from_file(f, chunk_size=_MB)) except ValueError: result = correct assert result == correct

d33tah · Answer 3 · 2015-06-12T18:35:42+0000

Here is a partial solution, but it continues to slow down as you type:

 #!/usr/bin/env pypy import json import cStringIO import sys def main(): BUFSIZE = 10240 f = sys.stdin decoder = json.JSONDecoder() io = cStringIO.StringIO() do_continue = True while True: read = f.read(BUFSIZE) if len(read) < BUFSIZE: do_continue = False io.write(read) try: data, offset = decoder.raw_decode(io.getvalue()) print(data) rest = io.getvalue()[offset:] if rest.startswith('\n'): rest = rest[1:] io = cStringIO.StringIO() io.write(rest) except ValueError, e: #print(e) #print(repr(io.getvalue())) continue if not do_continue: break if __name__ == '__main__': main()

How to handle a huge stream of JSON dictionaries?

More articles: