Java Regex includes new line in line

I am trying to match a regular expression to the textbook definitions that I get from the website. There is always a word with a new line in the definition, followed by the definition. For instance:

Zither Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern 

In my attempts to get only the word (in this case "Zither"), I continue to receive the newline character.

I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to fully match this word. I tested with rubular, http://rubular.com/r/LPEHCnS0ri ; which seems to successfully match all my attempts as I want, even though Java does not.

Here is my fragment

 String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above. Pattern rgx = Pattern.compile("^(\\S+)$"); Matcher mtch = rgx.matcher(str); if (mtch.find()) { String result = mtch.group(); terms.add(new SearchTerm(result, System.nanoTime())); } 

This is easy to solve by trimming the resulting string, but it seems like this should be unnecessary if I already use regex.

All help is much appreciated. Thanks in advance!

+6
source share
5 answers

Try using the Pattern.MULTILINE parameter

 Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE); 

This makes the regexp recognize line breaks in your line, otherwise ^ and $ just match the beginning and end of the line.

Although this makes no difference to this template, the Matcher.group() method returns a complete match, while the Matcher.group(int) method returns a match for a specific capture group (...) based on the number you specify. Your template specifies one capture group that you want to capture. If you included \s in your template, as you wrote, you tried, then Matcher.group() would include that space in the return value.

+8
source

With regular expressions, the first group is always a complete matching string. In your case, you need group 1, not group 0.

So changing mtch.group() to mtch.group(1) should do the trick:

  String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above. Pattern rgx = Pattern.compile("^(\\w+)\s"); Matcher mtch = rgx.matcher(str); if (mtch.find()) { String result = mtch.group(1); terms.add(new SearchTerm(result, System.nanoTime())); } 
+2
source

Just replace:

 String result = mtch.group(); 

By:

 String result = mtch.group(1); 

This will limit your output to the contents of the capture group (e.g. (\\w+) ).

+1
source

Late answer, but if you don't use Pattern and Matcher, you can use this DOTALL alternative in your DOTALL string

 (?s)[Your Expression] 

Basically (?s) also indicates a dot to match all characters, including line breaks

Details: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

+1
source

Try the following:

 /* The regex pattern: ^(\w+)\r?\n(.*)$ */ private static final REGEX_PATTERN = Pattern.compile("^(\\w+)\\r?\\n(.*)$"); public static void main(String[] args) { String input = "Zither\n Definition: An instrument of music"; System.out.println( REGEX_PATTERN.matcher(input).matches() ); // prints "true" System.out.println( REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2") ); // prints "Zither = Definition: An instrument of music" System.out.println( REGEX_PATTERN.matcher(input).replaceFirst("$1") ); // prints "Zither" } 
0
source

Source: https://habr.com/ru/post/951841/


All Articles