How to separate key-value pairs separated by commas with quoted commas

Question

How to separate key-value pairs separated by commas with quoted commas

I know that there are many other messages about parsing values, separated by commas, but I could not find one that separates key-value pairs and processes quoted commas.

I have lines like this:

age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"

And I want to get the following:

 { 'age': '12', 'name': 'bob', 'hobbies': 'games,reading', 'phrase': "I'm cool!", }

I tried using shlex as follows:

 lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''') lexer.whitespace_split = True lexer.whitespace = ',' props = dict(pair.split('=', 1) for pair in lexer)

The problem is that shlex will split the hobbies entry into two tokens, i.e. hobbies="games and reading" . Is there a way to make it allow for double quotes? Or is there another module that I can use?

EDIT: Fixed typo for whitespace_split

EDIT 2: I am not tied to using shlex . Regex is fine too, but I did not know how to handle the corresponding quotes.

+5

python parsing

Addison Dec 20 '14 at 0:13

source share

5 answers

This can be done with regex. In this case, this may be the best option. I think this will work with most input, even escaping quotes such as this: phrase='I\'m cool'

With the VERBOSE flag, this makes complex regular expressions fully readable.

 import re text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''' regex = re.compile( r''' (?P<key>\w+)= # Key consists of only alphanumerics (?P<quote>["']?) # Optional quote character. (?P<value>.*?) # Value is a non greedy match (?P=quote) # Closing quote equals the first. ($|,) # Entry ends with comma or end of string ''', re.VERBOSE ) d = {match.group('key'): match.group('value') for match in regex.finditer(text)} print(d) # {'name': 'bob', 'phrase': "I'm cool!", 'age': '12', 'hobbies': 'games,reading'}

+4

Heåken lid Dec 20 '14 at 0:53

source share

You can abuse the Python tokenizer to parse a list of key values:

 #!/usr/bin/env python from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER def parse_key_value_list(text): key = value = None for type, string, _,_,_ in generate_tokens(lambda it=iter([text]): next(it)): if type == NAME and key is None: key = string elif type in {NAME, NUMBER, STRING}: value = { NAME: lambda x: x, NUMBER: int, STRING: lambda x: x[1:-1] }[type](string) elif ((type == OP and string == ',') or (type == ENDMARKER and key is not None)): yield key, value key = value = None text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''' print(dict(parse_key_value_list(text)))

Exit

 {'phrase': "I'm cool!", 'age': 12, 'name': 'bob', 'hobbies': 'games,reading'}

You can use a state machine (FSM) to implement a stronger analyzer. The parser uses only the current state and the following token for syntax input:

 #!/usr/bin/env python from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER def parse_key_value_list(text): def check(condition): if not condition: raise ValueError((state, token)) KEY, EQ, VALUE, SEP = range(4) state = KEY for token in generate_tokens(lambda it=iter([text]): next(it)): type, string = token[:2] if state == KEY: check(type == NAME) key = string state = EQ elif state == EQ: check(type == OP and string == '=') state = VALUE elif state == VALUE: check(type in {NAME, NUMBER, STRING}) value = { NAME: lambda x: x, NUMBER: int, STRING: lambda x: x[1:-1] }[type](string) state = SEP elif state == SEP: check(type == OP and string == ',' or type == ENDMARKER) yield key, value state = KEY text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''' print(dict(parse_key_value_list(text)))

+2

jfs Dec 20 '14 at 2:34

source share

Well, I actually understood a pretty elegant way, which should be divided by a comma and an equal sign, and then take 2 tokens at a time.

 input_str = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''' lexer = shlex.shlex(input_str) lexer.whitespace_split = True lexer.whitespace = ',=' ret = {} try: while True: key = next(lexer) value = next(lexer) # Remove surrounding quotes if len(value) >= 2 and (value[0] == value[-1] == '"' or value[0] == value[-1] == '\''): value = value[1:-1] ret[key] = value except StopIteration: # Somehow do error checking to see if you ended up with an extra token. pass print ret

Then you will get:

 { 'age': '12', 'name': 'bob', 'hobbies': 'games,reading', 'phrase': "I'm cool!", }

However, this does not verify that you do not have such strange things as: age,12=name,bob , but I am fine with it in my use case.

EDIT: handle both double quotes and single quotes.

+1

Addison Dec 20 '14 at 0:35

source share

Python offers many ways to solve a problem. Here is a little more c as an implemented method handling each char. It would be interesting to know the different runtimes.

 str = 'age=12,name=bob,hobbies="games,reading",phrase="I\'m cool!"' key = "" val = "" dict = {} parse_string = False parse_key = True # parse_val = False for c in str: print(c) if c == '"' and not parse_string: parse_string = True continue elif c == '"' and parse_string: parse_string = False continue if parse_string: val += c continue if c == ',': # terminate entry dict[key] = val #add to dict key = "" val = "" parse_key = True continue elif c == '=' and parse_key: parse_key = False elif parse_key: key += c else: val+=c dict[key] = val print(dict.items()) # {'phrase': "I'm cool!", 'age': '12', 'name': 'bob', 'hobbies': 'games,reading'}

demo: http://repl.it/6oC/1

0

abimelex Dec 20 '14 at 1:04

source share

pistache · Accepted Answer · 2016-08-05T10:59:34+0000

You just need to use the shlex shlex in POSIX mode.

Add posix=True when creating lexer.

(see shlex analysis rules )

 lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''', posix=True) lexer.whitespace_split = True lexer.whitespace = ',' props = dict(pair.split('=', 1) for pair in lexer)

Outputs:

 {'age': '12', 'phrase': "I'm cool!", 'hobbies': 'games,reading', 'name': 'bob'}

PS: Regular expressions will not be able to parse key-value pairs if = or , can be specified at the input. Even string preprocessing could not parse the input using a regular expression, because such an input cannot be formally defined as a regular language.

How to separate key-value pairs separated by commas with quoted commas

Exit

More articles: