Python Regex does not match. (dot) as a symbol

I have a regex that matches all three characters in a string:

\b[^\s]{3}\b 

When I use it with a string:

 And the tiger attacked you. 

this is the result:

 regex = re.compile("\b[^\s]{3}\b") regex.findall(string) [u'And', u'the', u'you'] 

As you can see, this matches you as a three-character word, but I want the expression to accept "you." with "." like a word 4 characters.

I have the same problem with ",", ";", ":" etc.

I am new to regex, but I guess this happens because these characters are treated as word boundaries.

Is there any way to do this?

Thanks in advance,

EDIT

Thanks for the answers @BrenBarn and @Kendall Frey I managed to find the regex that I was looking for:

 (?<!\w)[^\s]{3}(?=$|\s) 
+4
source share
3 answers

If you want the word to precede and be followed by a space (and not such a period as occurs in your case), use lookaround .

 (?<=\s)\w{3}(?=\s) 

If you need it to match punctuation as part of words (like "inches") then \w will not be adequate and you can use \S (something other than a space)

 (?<=\s)\S{3}(?=\s) 
+3
source

As described in the documentation :

A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by a space or non-alphanumeric character without underscore.

So, if you want a period to be considered a symbol of a word, not a word boundary, you cannot use \b to denote a word boundary. You will have to use your own character class. For example, you can use a regular expression, for example \s[^\s]{3}\s , if you want to match 3 non-spatial characters surrounded by spaces. If you still want the border to be zero width (that is, it limited the match but was not included in it), you can use lookaround, something like (?<=\s)[^\s]{3}(?=\s) .

+1
source

This will be my approach. It also matches the words that appear immediately after punctuation.

 import re r = r''' \b # word boundary ( # capturing parentheses [^\s]{3} # anything but whitespace 3 times \b # word boundary (?=[^\.,;:]|$) # dont allow . or , or ; or : after word boundary but allow end of string | # OR [^\s]{2} # anything but whitespace 2 times [\.,;:] # a . or , or ; or : ) ''' s = 'And the tiger attacked you. on,bla tw; th: fo.tes' print re.findall(r, s, re.X) 

output:

 ['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes'] 
+1
source

Source: https://habr.com/ru/post/1478854/


All Articles