Python regex for splitting into specific patterns with missing patterns

I want to break a Python string into specific patterns, but not into others. For example, I have a line

Joe, Dave, Professional, Ph.D. and Someone else 

I want to divide by \sand\s and, but not , Ph.D.

How can this be done in Python regex?

+6
source share
3 answers

You can use:

 re.split(r'\s+and\s+|,(?!\s*Ph\.D\.)\s*', 'Joe, Dave, Professional, Ph.D. and Someone else') 

Result:

 ['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else'] 
+3
source

You can basically do this with regular expressions or just regular string manipulation operations (i.e. str.split() )

Here is an example that shows you how to do this using the regular string of a manipulation operation:

 >>> DATA = 'Joe, Dave, Professional, Ph.D. and Someone else' >>> IGNORE_THESE = frozenset([',', 'and']) >>> PRUNED_DATA = [d.strip(',') for d in DATA.split(' ') if d not in IGNORE_THESE] >>>> print PRUNED_DATA ['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone', 'else'] 

I'm sure there will be some kind of complicated regular expression that you can use, but it seems very straightforward to me and quite serviceable.

I hope you are not trying to parse the natural language, for this I would use some other library, for example NLTK

0
source
 import re DATA = 'Joe, Dave, Professional, Ph.D. and Someone else' regx = re.compile('\s*(?:,|and)\s*') print regx.split(DATA) 

result

 ['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone else'] 

Where is the difficulty?

Please note that with (?:,|and) delimiters are not displayed as a result, and with (;|and) result will be

 ['Joe', ',', 'Dave', ',', 'Professional', ',', 'Ph.D.', 'and', 'Someone else'] 

Change 1

errrr .... the difficulty is that with

 DATA = 'Joe, Dave, Professional, Handicaped, Ph.D. and Someone else' 

result

 ['Joe', 'Dave', 'Professional', 'H', 'icaped', 'Ph.D.', 'Someone else'] 

.

Fixed:

 regx = re.compile('\s+and\s+|\s*,\s*') 

.

Edit 2

errrr .. ah ... ah ...

Sorry, I did not notice that the Professional, Ph.D. should not be divided. But what is the criterion not to separate by comma in this line?

I chose this criterion: "a comma followed by a line that has dots before the next comma"

Another problem is the confusion of spaces and the words "and".

As well as the problem of headers and trailing spaces.

Finally, I managed to write a regex pattern that manages a lot more cases than the previous one, even if some cases are somewhat artificial (for example, lost and present at the end of the line, and why at the beginning too, too? Etc. ):

 import re regx = re.compile('\s*,(?!(?:[^,.]+?\.)+[^,.]*?,)(?:\sand[,\s]+|\s)*|\s*and[,\s]+|[.\s]*\Z|\A\s*') DATA = ' Joe ,and, Dave , Professional, Ph.D., Handicapped and handyman , and Someone else and . .' print repr(DATA) print print regx.split(DATA) 

result

 ' Joe ,and, Dave , Professional, Ph.D., Handicapped and handyman , and Someone else and . .' ['', 'Joe', '', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else', '', ''] 

.

With print [x for x in regx.split(DATA) if x] we get:

 ['Joe', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else'] 

.

Compared to the result of the regular expression Qtax on the same line:

 [' Joe ', 'and', 'Dave ', 'Professional, Ph.D.', 'Handicapped', 'handyman ', 'and Someone else', '. .'] 
0
source

Source: https://habr.com/ru/post/893840/


All Articles