Sorry if this is redundant, but a rather deep search for intertubes did not show anything meaningful here.
I have a row from a (chemical) database where delimiters (commas) sometimes appear in elements that I hope to separate. Example string
s = '2-Methyl-3-phythyl-1,4-naphthochinon,Vitamin, K1,Antihemorrhagic vitamin'
The correct split in this case will give
splitS = ['2-Methyl-3-phythyl-1,4-naphthochinon', 'Vitamin, K1', 'Antihemorrhagic vitamin']
I believe that the most accurate way I can do this is to break it into commas that do not have spaces next to the comma, and which are not further surrounded by two numbers. This would leave cases like "1.4" and "Vitamin, K1", but split the string into the correct 3 chemical names.
I tried using RE unsuccessfully. I can post some of what I tried, but it is almost useless. Help is much appreciated.
EDIT: Had to include this initially. Through some of my hacks, and from a more elegant solution from @Borealid, I correctly identified the places for splitting, but I get a disgusting result, for example
>>> s = '2-Methyl-3-phythyl-1,4-naphthochinon,Vitamin, K1,Antihemorrhagic vitamin' >>> pat = re.compile("([^\d\s],[^\d\s])|([^\s],[^\d\s])|([^\d\s],[^\s])") >>> re.split(pat, s) ['2-Methyl-3-phythyl-1,4-naphthochino', 'n,V', None, None, 'itamin, K', None, '1,A', None, 'ntihemorrhagic vitamin']
It seems like there should be a way to first determine the correct commas to separate, and then split only by a comma to avoid damaging the names.
Thanks again