Search for words with [a-zA-Z] from a sentence using Regex

I am trying to get all the words in a sentence with a regular expression, but only with [a-zA-Z]. Therefore, for "I am a boy" I want ("I", "I", "A", "Boy"} but for "I a1m ab * y" I want {"I", "a"}, because "a1m "and" b * y "contain characters other than [a-zA-Z].

So, to get the words, I'm trying to check

  • if it is at the beginning of the line, then I only check if there is a space after the word
  • else there is a space before and after the word
  • if this is the last word, then check to see if there is a space before the word.

So, I got something like this in Java:

Pattern p = Pattern.compile("^[a-zA-Z]+ |^[a-zA-Z]+$| [a-zA-Z]+$| [a-zA-Z]+"); Matcher m = p.matcher("i am good"); while(m.find()) System.out.println(m.group()); 

However, I only get " i " and " good ." Because when I get "i", there is one place after "i". So, the line to the left is " am good " Since " am " is not at the beginning of the line and has no space before the word, it does not return.

Can you guys give feedback? Is there a way to just look at the next character and not return the space?

+4
source share
3 answers

Assuming your regex mechanism supports lookahead / lookbehind statements, you could use something like the following:

 (^|(?<= )[a-zA-Z]+($|(?= )) 

Here is a brief description of what each component does:

(^|(?<= )) : This says: "If the word begins here, we are interested." In particular,
^ : match the beginning of a line or (?<= ) : match any point preceded by a space, without actually consuming the space itself. This is called a positive lookbehind statement.

[a-zA-Z]+ : This should be obvious, but it matches any run of consecutive ASCII letters.

($|(?= )) : This says: "If the word is finished here, we are finished." In particular,
$ : match end of line, or
(?= ) : match any point followed by a space without actually consuming the space itself. This is called a positive statement.


Note that this particular regular expression does not count a word as a word if it follows punctuation. It may not be the way you want it, but you have described checking for spaces so that it is performed by a regular expression. If you want to support words followed by simple punctuation, you can change what the last atom will be

 ($|(?=[ .,!?])) 

which will match the word if followed by a space, period, comma, exclamation mark or question mark. You can be more complicated if you want.

+6
source

Could you use a simpler pattern, for example \b[A-Za-z]+\b ? (The metacharacter \ b separates word characters (e.g. letters) from non-word characters (e.g. spaces and punctuation marks.))

Code

 Pattern p = Pattern.compile("\\b[A-Za-z]+\\b"); Matcher m = p.matcher("i am good"); while(m.find()) System.out.println(m.group()); 

Produces {"i", "am", "good"}.

Edit As commented on a math comment. Expression

 (?<=^|\s)[A-Za-z]+(?=\W*(?:\s*$|\s)) 

may work better. For row I a1m ab*y boy am is!! or I a1m ab*y boy am is!! or matching produces "I", "a", "boy", "am", "is", "or".

If the previous expression is is !! should be ignored, instead you can use the expression (?<=^|\s)[A-Za-z]+(?=$|\s) . In the previous example, it does not return "is", but returns other words (I, a, boy, am or).

+2
source

This is just a note if you did not want to use something like Kevin Ballard. You can split the string into tokens, and from there you can check each token to make sure that it contains only [a-zA-Z].

To break it into tokens, do something like this:

 String message="The text of the message to be scanned."; StringTokenizer st=new StringTokenizer(message); while (st.hasMoreTokens()) { checkWord(st.nextToken()); idx++; } 

And then you have to write a function to check if this token consists of [a-zA-Z]. Since there will be no space for a solution, I think it will be much easier for you to deal with these tokens, and not with the full string.

Good luck.

0
source

Source: https://habr.com/ru/post/1391188/


All Articles