Pattern Matcher Vs String Split which should I use?

First time post.

First, I know how to use both the Matcher and String Split patterns. My questions that are best used in my example and why? Or suggestions for better alternatives.

Task: I need to extract an unknown NOUN between two known regular expressions in an unknown string.

My solution: get the beginning and end of the noun (from Regexp 1 and 2) and the substring to extract the noun.

String line = "unknownXoooXNOUNXccccccXunknown"; int goal = 12 ; String regexp1 = "Xo+X"; String regexp2 = "Xc+X"; 
  • I need to find the index position AFTER the first regular expression.
  • I need to find the index position before the second regular expression.

A) I can use a template template

  Pattern p = Pattern.compile(regexp1); Matcher m = p.matcher(line); if (m.find()) { int afterRegex1 = m.end(); } else { throw new IllegalArgumentException(); //TODO Exception Management; } 

B) I can use String Split

  String[] split = line.split(regex1,2); if (split.length != 2) { throw new UnsupportedOperationException(); //TODO Exception Management; } int afterRegex1 = line.indexOf(split[1]); 

Which approach should I use and why? I do not know what is more effective in time and in memory. Both are close enough to what is readable to me.

+6
source share
4 answers

I would do it like this:

 String line = "unknownXoooXNOUNXccccccXunknown"; String regex = "Xo+X(.*?)Xc+X"; Pattern p = Pattern.compile(regex); Matcher m = p.matcher(line); if (m.find()) { String noun = m.group(1); } 

(.*?) used to make internal NOUN reluctance. This protects us from the case when our final pattern appears again in an unknown part of the string.

EDIT

This works because (.*?) Defines a capture group. There, only one such group is defined in the template, so it gets index 1 (parameter m.group(1) ). These groups are indexed from left to right, starting with 1. If the template was defined as follows:

 String regex = "(Xo+X)(.*?)(Xc+X)"; 

Then there would be three capture groups, such that

 m.group(1); // yields "XoooX" m.group(2); // yields "NOUN" m.group(3); // yields "XccccccX" 

There is group 0, but it matches the whole pattern and is equivalent to this

 m.group(); // yields "XoooXNOUNXccccccX" 

For more information on what you can do with Matcher , including ways to get the start and end position of your template in the source string, see Matcher JavaDocs

+5
source

You should use String.split() to read if you are not in a tight loop.

Per split() javadoc , split() makes the equivalent of Pattern.compile() , which you can optimize if you are in a tight loop.

+3
source

It looks like you want a unique event. Just for that

 input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1") 

For efficiency, use Pattern.matcher(input).replaceAll .

If you enter lines, use Pattern.DOTALL or the s modifier.


If you want to use split, consider using Guava Splitter . It behaves more intelligently and also adopts a Pattern that is good for speed.

+2
source

If you really need locations, you can do it like this:

 String line = "unknownXoooXNOUNXccccccXunknown"; String regexp1 = "Xo+X"; String regexp2 = "Xc+X"; Matcher m=Pattern.compile(regexp1).matcher(line); if(m.find()) { int start=m.end(); if(m.usePattern(Pattern.compile(regexp2)).find()) { final int end = m.start(); System.out.println("from "+start+" to "+end+" is "+line.substring(start, end)); } } 

But if you just need a word between them, I recommend, as Ian MacLaird has shown.

0
source

Source: https://habr.com/ru/post/956106/


All Articles