Python: Bad JSON - Keys Not Cited

I am scraping some JSONP dictionaries from AWS (from javascript files). After analyzing the raw data for JSON-like data only, in some cases I get valid JSON and can successfully load it in Python ( json_data = json.loads(json_like_data) ). However, some of Amazon JSONPs do not include quotation marks around their keys (see below).

 ... {type:"storageCurrentGen",sizes: [{size:"i2.xlarge",vCPU:"4",ECU:"14",memoryGiB:"30.5",storageGB:"1 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"0.938"}}]}, {size:"i2.2xlarge",vCPU:"8",ECU:"27",memoryGiB:"61",storageGB:"2 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"1.876"}}]}, {size:"i2.4xlarge",vCPU:"16",ECU:"53",memoryGiB:"122",storageGB:"4 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"3.751"}}]}, ... 

For JSONP, this still works as it is valid JavaScript syntax. However, Python json.loads(json_str) crap, because it is not valid JSON.

There is another Python YAML module that can handle unquoted keys, but there must be a space ( : after the comma.

I believe that I have two options.

  • Somehow replace the character between the open curly bracket or comma ( { | , ) and the colon ( : . Then use json.loads(...) .
  • Add a space after the colon ( : . Then yaml.load(...) with yaml.load(...) .

I suggest option 2 is better than 1. However, I am looking for a suggestion for a better solution.

Has anyone come across invalid JSON like this before and used Python to parse it?

+6
source share
3 answers

You have an HJSON document , after which you can use the hjson project to hjson it:

 >>> import hjson >>> hjson.loads('{javascript_style:"Look ma, no quotes!"}') OrderedDict([('javascript_style', 'Look ma, no quotes!')]) 

HJSON is JSON without the requirement to quote object names, and even for certain string values, with added support for comments and multi-line strings, as well as simplified rules in which commas should be used (including not using commas at all).

Or you can install and use the demjson library ; it supports parsing valid JavaScript (missing quotes):

 import demjson result = demjson.decode(jsonp_payload) 

Only when you set the strict=True flag demjson refuse to parse your input:

 >>> import demjson >>> demjson.decode('{javascript_style:"Look ma, no quotes!"}') {u'javascript_style': u'Look ma, no quotes!'} >>> demjson.decode('{javascript_style:"Look ma, no quotes!"}', strict=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/demjson.py", line 5701, in decode return_stats=(return_stats or write_stats) ) File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/demjson.py", line 4917, in decode raise errors[0] demjson.JSONDecodeError: ('JSON does not allow identifiers to be used as strings', u'javascript_style') 

Using a regular expression, you can try to find the correct JSON; however, this can lead to false positives. The template will be:

 import re valid_json = re.sub(r'(?<={|,)([a-zA-Z][a-zA-Z0-9]*)(?=:)', r'"\1"', jsonp_payload) 

This matches { or followed by a JavaScript identifier (a character followed by additional characters or numbers) and immediately after it : colon. If your specified values ​​contain any such patterns, you will receive invalid JSON.

+18
source

You can also do this (in this particular case) with a simple regex:

 ll = '{type:"storageCurrentGen",sizes:\n[{size:"i2.xlarge",vCPU:"4",ECU:"14",memoryGiB:"30.5",storageGB:"1 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"0.938"}}]},\n{size:"i2.2xlarge",vCPU:"8",ECU:"27",memoryGiB:"61",storageGB:"2 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"1.876"}}]},\n{size:"i2.4xlarge",vCPU:"16",ECU:"53",memoryGiB:"122",storageGB:"4 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"3.751"}}]},' ll_patched = re.sub('([{,:])(\w+)([},:])','\\1\"\\2\"\\3',ll) >>> ll_patched '{"type":"storageCurrentGen","sizes":\n[{"size":"i2.xlarge","vCPU":"4","ECU":"14","memoryGiB":"30.5","storageGB":"1 x 800 SSD","valueColumns":[{"name":"linux","prices":{"USD":"0.938"}}]},\n{"size":"i2.2xlarge","vCPU":"8","ECU":"27","memoryGiB":"61","storageGB":"2 x 800 SSD","valueColumns":[{"name":"linux","prices":{"USD":"1.876"}}]},\n{"size":"i2.4xlarge","vCPU":"16","ECU":"53","memoryGiB":"122","storageGB":"4 x 800 SSD","valueColumns":[{"name":"linux","prices":{"USD":"3.751"}}]},' 
+5
source
 import demjson result = demjson.decode(' { key: "value" }' ) 

works like a charm. enjoy

0
source

Source: https://habr.com/ru/post/1240700/


All Articles