Context sensitive line split in python

Question

Context sensitive line split in python

Sorry if this is redundant, but a rather deep search for intertubes did not show anything meaningful here.

I have a row from a (chemical) database where delimiters (commas) sometimes appear in elements that I hope to separate. Example string

s = '2-Methyl-3-phythyl-1,4-naphthochinon,Vitamin, K1,Antihemorrhagic vitamin'

The correct split in this case will give

 splitS = ['2-Methyl-3-phythyl-1,4-naphthochinon', 'Vitamin, K1', 'Antihemorrhagic vitamin']

I believe that the most accurate way I can do this is to break it into commas that do not have spaces next to the comma, and which are not further surrounded by two numbers. This would leave cases like "1.4" and "Vitamin, K1", but split the string into the correct 3 chemical names.

I tried using RE unsuccessfully. I can post some of what I tried, but it is almost useless. Help is much appreciated.

EDIT: Had to include this initially. Through some of my hacks, and from a more elegant solution from @Borealid, I correctly identified the places for splitting, but I get a disgusting result, for example

 >>> s = '2-Methyl-3-phythyl-1,4-naphthochinon,Vitamin, K1,Antihemorrhagic vitamin' >>> pat = re.compile("([^\d\s],[^\d\s])|([^\s],[^\d\s])|([^\d\s],[^\s])") >>> re.split(pat, s) ['2-Methyl-3-phythyl-1,4-naphthochino', 'n,V', None, None, 'itamin, K', None, '1,A', None, 'ntihemorrhagic vitamin']

It seems like there should be a way to first determine the correct commas to separate, and then split only by a comma to avoid damaging the names.

Thanks again

+4

python string regex

theFuriousNoob Jan 27 '12 at 0:09

source share

3 answers

Something like ([^\d\s],[^\d\s])|([^\s],[^\d\s])|([^\d\s],[^\s]) ?

A comma with ((number on both sides) or (number on the side of the tail, but not on the side of the head) or (number on the side of the head, but not on the back side)).

In all cases, there are no spaces near the comma.

\d is a number. \s is a space. [] is a character class - [^] is an inverted character class ("matches a character that is not in the subsequent contents")

It is not separated by commas in the very first or last position of the line, but I do not think this will be a problem.

0

Borealid Jan 27 '12 at 0:15

source share

I have a solution, but it is a bit long. Ok, here we go:

 s = '2-Methyl-3-phythyl-1,4-naphthochinon,Vitamin, K1,Antihemorrhagic vitamin'

First we find all the positions of all the commas in the line (in all_commas ) and the positions of all the special commas (in special_commas ):

 all_commas = [match.start() for match in re.finditer(r',', s)] special_commas = [match.start()+1 for match in re.finditer(r'\d,\d|.,\s', s)]

Secondly, we get a difference from these positions (in split_commas ). Now we have positions where we are going to split:

 split_commas = set(all_commas) - set(special_commas)

Then we will splitS over these split positions and save the split lines in splitS

 splitS = [] start = -1 for end in sorted(split_commas) + [None]: splitS.append(s[start+1:end]) start = end

Finally, what we get in splitS :

 >>> splitS ['2-Methyl-3-phythyl-1,4-naphthochinon', 'Vitamin, K1', 'Antihemorrhagic vitamin']

0

juliomalegria Jan 27 '12 at 0:51

source share

Andrew Clark · Accepted Answer · 2012-01-27T00:48:36+0000

You can get this behavior using lookaround so that you only match commas that match your explanation:

 (?<!\d),(?! )|(?<=\d),(?![\d ])

And you seem to have the correct behavior for the example line:

 >>> re.split(r'(?<!\d),(?! )|(?<=\d),(?![\d ])', s) ['2-Methyl-3-phythyl-1,4-naphthochinon', 'Vitamin, K1', 'Antihemorrhagic vitamin']

Here is an explanation:

  (?<!\d), # match a comma that is not preceeded by a digit... (?! ) # ... as long as it is not followed by a space | # OR (?<=\d), # match a comma that is preceeded by a digit... (?![\d ]) # ... as long as it is not followed by a digit or a space

After writing the explanation, I realized that the part (?<=\d) in the regular expression is not needed, because it is implicitly related to the first part of the regular expression that does not match, which means that you can shorten it to the next and get the same behavior :

 (?<!\d),(?! )|,(?![\d ])

Context sensitive line split in python

More articles: