Separate a string with a delimiter only if it is not packed in a specific pattern

Question

Separate a string with a delimiter only if it is not packed in a specific pattern

I'm trying to split a string into a list using a separator (say,), but the separator character should be considered as a separator only if it is not wrapped in a specific pattern, in my particular case <> , IOW, when the comma is nested in <> , it is ignored as a separator and becomes just a regular character, which should not be limited.

So, if I have the following line:

 "first token, <second token part 1, second token part 2>, third token"

he must break into

 list[0] = "first token" list[1] = "second token part 1, second token part 2" list[2] = "third token"

Needless to say, I can't just make a simple split into , because it will split the second token into two tokens, second token part 1 and second token part 2 , since they have a comma between them.

How to define a template for this using Python RegEx ?

+6

python regex

amphibient Nov 21 '13 at 18:03

source share

2 answers

One way that works for your example is to translate <> to "and then treat it as a CSV file:

 import csv import string s = "first token, <second token part 1, second token part 2>, third token" a = s.translate(string.maketrans('<>', '""')) # first token, "second token part 1, second token part 2", third token print next(csv.reader([a], skipinitialspace=True)) ['first token', 'second token part 1, second token part 2', 'third token']

+5

Jon clements Nov 21 '13 at 18:12

source share

Tim pietzcker · Accepted Answer · 2013-11-21T18:13:17+0000

Update:. Since you mentioned that parentheses can be nested, I regret to inform you that in Python it is not possible to use a regular expression. The following can only work if angle brackets are always balanced and never nested or escaped:

 >>> import re >>> s = "first token, <second token part 1, second token part 2>, third token" >>> regex = re.compile(",(?![^<>]*>)") >>> regex.split(s) ['first token', ' <second token part 1, second token part 2>', ' third token'] >>> [item.strip(" <>") for item in _] ['first token', 'second token part 1, second token part 2', 'third token']

The regular expression ,(?![^<>]*>) separated by commas only if the next angle bracket is not a closing bracket.

Nested brackets exclude this or any other regular expression from working in Python. You need a language that supports recursive regular expressions (such as Perl or .NET) or uses a parser.

Separate a string with a delimiter only if it is not packed in a specific pattern

More articles: