Python: parsing JSON-like Javascript data structures (with sequential commas)

I would like to parse JSON-like strings. Their single difference with regular JSON is the presence of contiguous commas in arrays. When there are two such commas, this implicitly means that null must be inserted between them. Example:

  JSON-like: ["foo",,,"bar",[1,,3,4]] Javascript: ["foo",null,null,"bar",[1,null,3,4]] Decoded (Python): ["foo", None, None, "bar", [1, None, 3, 4]] 

The native json.JSONDecoder class does not allow me to change the behavior of parsing an array. I can only modify the parser for objects (dicts), ints, floats, strings (passing the kwargs functions to JSONDecoder() , see the document ).

So, does this mean that I have to write a JSON parser from scratch? Python json code is available, but it's pretty mess. I would prefer to use its internal elements instead of duplicating its code!

+1
source share
5 answers

Since what you are trying to parse is not JSON per se, but rather a different language that is very similar to JSON, you may need your own parser.

Fortunately, it is not as difficult as it seems. You can use the Python parser generator, for example pyparsing . JSON can be fully defined using a fairly simple context-free grammar (I found here here ), so you should be able to modify it to suit your needs.

+5
source

A small and simple workaround for testing:

  • Convert JSON-like data to strings.
  • Replace "," with ", null".
  • Transform it into any of your ideas.
  • Let JSONDecoder () do a heavy lift.

    • & 3. may be omitted if you are already dealing with strings.

(And if converting to a string is not practical, update your question with this information!)

+3
source

You can replace the Lattyware comma / przemo_li responses in one pass using the lookbehind expression, that is, "replace all commas that are preceded only by a comma":

 >>> s = '["foo",,,"bar",[1,,3,4]]' >>> re.sub(r'(?<=,)\s*,', ' null,', s) '["foo", null, null,"bar",[1, null,3,4]]' 

Note that this will work for small things, where you can assume that, for example, there are no consecutive commas in string literals. In general, regular expressions are not enough to solve this problem, and Taimon's approach using a real parser is the only completely correct solution.

+2
source

This is a hacker way to do this, but one solution is to simply do some string modification of the JSON-ish data to get it in the string before it is parsed.

 import re import json not_quite_json = '["foo",,,"bar",[1,,3,4]]' not_json = True while not_json: not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json) 

What leaves us:

 '["foo", null, null, "bar",[1, null, 3,4]]' 

Then we can:

 json.loads(not_quite_json) 

Providing us with:

 ['foo', None, None, 'bar', [1, None, 3, 4]] 

Please note that this is not as simple as replacing, since replacing also inserts commas that may need replacing. Given this, you need to go through until no changes are made. Here I used a simple regular expression to do the job.

+1
source

I looked at Taimon's recommendation, pyraring, and I have successfully cracked the example below here according to my needs. It works great when modeling Javascript eval() , but one situation fails : a comma. There must be an optional comma - see the tests below, but I cannot find the right way to implement this.

 from pyparsing import * TRUE = Keyword("true").setParseAction(replaceWith(True)) FALSE = Keyword("false").setParseAction(replaceWith(False)) NULL = Keyword("null").setParseAction(replaceWith(None)) jsonString = dblQuotedString.setParseAction(removeQuotes) jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) + Optional('.' + Word(nums)) + Optional(Word('eE', exact=1) + Word(nums + '+-', nums))) jsonObject = Forward() jsonValue = Forward() # black magic begins commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None)) jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull) # black magic ends jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']')) jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL) memberDef = Group(jsonString + Suppress(':') + jsonValue) jsonMembers = delimitedList(memberDef) jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}')) jsonComment = cppStyleComment jsonObject.ignore(jsonComment) def convertNumbers(s, l, toks): n = toks[0] try: return int(n) except ValueError: return float(n) jsonNumber.setParseAction(convertNumbers) def test(): tests = ( '[1,2]', # ok '[,]', # ok '[,,]', # ok '[ , , , ]', # ok '[,1]', # ok '[,,1]', # ok '[1,,2]', # ok '[1,]', # failure, I got [1, None], I should have [1] '[1,,]', # failure, I got [1, None, None], I should have [1, None] ) for test in tests: results = jsonArray.parseString(test) print(results.asList()) 
+1
source

Source: https://habr.com/ru/post/970290/


All Articles