Multiple Json objects in one python extraction file

I am very new to Json files. If I have a json file with multiple json objects, for example:

{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes", "Code":[{"event1":"A","result":"1"},…]} {"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No", "Code":[{"event1":"B","result":"1"},…]} {"ID":"AA356","Timestamp":"20140103", "Usefulness":"No", "Code":[{"event1":"B","result":"0"},…]} … 

I want to extract all the timestamps and Utility into data frames:

  Timestamp Usefulness 0 20140101 Yes 1 20140102 No 2 20140103 No … 

Does anyone know a general way to deal with such problems? Thanks!

+29
source share
5 answers

Use a json array in the format:

 [ {"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes", "Code":[{"event1":"A","result":"1"},…]}, {"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No", "Code":[{"event1":"B","result":"1"},…]}, {"ID":"AA356","Timestamp":"20140103", "Usefulness":"No", "Code":[{"event1":"B","result":"0"},…]}, ... ] 

Then import it into your Python code

 import json with open('file.json') as json_file: data = json.load(json_file) 

Now the data content is an array with dictionaries representing each of the elements.

You can easily access it, i.e.

 data[0]["ID"] 
+22
source

You can use json.JSONDecoder.raw_decode to decode arbitrarily large strings of "complex" JSON (as long as they can fit in memory). raw_decode stops when it has a valid object, and returns the last position that was not part of the analyzed object. It is not documented, but you can pass this position back to raw_decode and start parsing that position again. Unfortunately, the Python json module does not accept spaces prefixed with spaces. Therefore, we need to perform a search to find the first part without spaces in your document.

 from json import JSONDecoder, JSONDecodeError import re NOT_WHITESPACE = re.compile(r'[^\s]') def decode_stacked(document, pos=0, decoder=JSONDecoder()): while True: match = NOT_WHITESPACE.search(document, pos) if not match: return pos = match.start() try: obj, pos = decoder.raw_decode(document, pos) except JSONDecodeError: # do something sensible if there some error raise yield obj s = """ {"a": 1} [ 1 , 2 ] """ for obj in decode_stacked(s): print(obj) 

prints:

 {'a': 1} [1, 2] 
+62
source

So, as mentioned in the comment pairs containing the data in the array, it’s easier, but the solution does not scale in terms of efficiency as the size of the data set increases. You really should only use an iterator when you want to access a random object in an array, otherwise generators are the way to go. Below I created a prototype of a reader function that reads each json object individually and returns a generator.

The basic idea is to signal the reader to separate the carriage character "\ n" (or "\ r \ n" for Windows). Python can do this with the file.readline () function.

 import json def json_readr(file): for line in open(file, mode="r"): yield json.loads(line) 

However, this method only works when the file is written as it is, with each object separated by a new line character. Below I wrote an example of an author who separates an array of json objects and saves each in a new line.

 def json_writr(file, json_objects): f = open(file, mode="w") for jsonobj in json_objects: jsonstr = json.dumps(jsonobj) f.write(jsonstr+"\n") f.flush() f.close() 

You can also do the same operation with file.writelines () and list comprehension

 ... jsobjs = [json.dumps(j)+"\n" for j in json_objects] f.writelines(jsobjs) ... 

And if you want to add data instead of writing a new file, just change 'mode =' w "'to' mode =" a ".

In the end, I find that it helps a lot not only with readability when I try to open json files in a text editor, but also in terms of memory usage more efficiently.

In this note, if you change your mind at some point and you want the list not to be read, Python allows you to place the generator function inside the list and automatically populate the list. In other words, just write

 lst = list(json_readr(file)) 

Hope this helps. Sorry if this was a bit verbose.

+8
source

When you analyze objects, you are dealing with dictionaries. You can retrieve the desired values ​​by performing a key search. For instance. value = jsonDictionary['Usefulness'] .

You can scroll through json objects with a for loop. eg:.

 for obj in bunchOfObjs: value = obj['Usefulness'] #now do something with your value, eg insert into panda.... 
0
source

Added support for streaming based on @ dunes answer:

 import re from json import JSONDecoder, JSONDecodeError NOT_WHITESPACE = re.compile(r"[^\s]") def stream_json(file_obj, buf_size=1024, decoder=JSONDecoder()): buf = "" ex = None while True: block = file_obj.read(buf_size) if not block: break buf += block pos = 0 while True: match = NOT_WHITESPACE.search(buf, pos) if not match: break pos = match.start() try: obj, pos = decoder.raw_decode(buf, pos) except JSONDecodeError as e: ex = e break else: ex = None yield obj buf = buf[pos:] if ex is not None: raise ex 
0
source

Source: https://habr.com/ru/post/980878/


All Articles