Find capitalized words not at the beginning of a regular expression sentence

Using Python and regex, I try to find words in a piece of text that starts with a capital letter but is not at the beginning of the sentence.

The best way I can come up with is to check that the word is not preceded by a full stop, and then a space. I am sure I need to use a negative lookbehind. This is what I have so far, it will work, but always returns nothing:

(?<!\.\s)\b[AZ][az]*\b 

I think the problem might be to use [AZ] [az] * inside the border of the word \ b, but I'm really not sure.

Thanks for the help.

+4
source share
3 answers

Your regex seems to work:

 In [6]: import re In [7]: re.findall(r'(?<!\.\s)\b[AZ][az]*\b', 'lookbehind. This is what I have') Out[7]: ['I'] 

Make sure you use the raw string ( r'...' ) when specifying the regular expression.

If you have certain inputs where the regex doesn't work, add them to your question.

+2
source

Although you have specifically asked for regex, it may be interesting to consider understanding the list as well. They are sometimes a little readable (although in this case, probably at the cost of efficiency). Here is one way to achieve this:

 import string S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \ "Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe." LS = S.split(' ') words = [x for (pre,x) in zip(['.']+LS, LS+[' ']) if (x[0] in string.uppercase) and (pre[-1] != '.')] 
+1
source

Try and loop your input with:

 (?!^)\b([AZ]\w+) 

and capture the first group. As you can see, you can use a negative look, because the position you want to match is all but the beginning of the line. A negative lookbehind will have the same effect.

0
source

Source: https://habr.com/ru/post/1389505/


All Articles