Why is most string manipulation in Java based on regexp?

Question

Why is most string manipulation in Java based on regexp?

There are many methods in Java that are all related to string manipulation. The simplest example is the String.split ("something") method.

Now the actual definition of many of these methods is that they all take a regular expression as their input parameter (s). What then does all the very powerful building blocks do.

Now there are two effects that you will see in many of these methods:

They recompile the expression every time the method is called. Thus, they affect performance.
I found that in most situations of "real life" these methods are called "fixed" texts. The most common use of the split method is even worse: it is usually called with a single char (usually a ', a'; 'or' & ') to split.

Thus, it’s not only that the default methods are powerful, they also seem suppressed for what they are actually used for. Inside, we developed the "fastSplit" method, which is split into fixed lines. I wrote a test at home to find out how much faster I could do it if it were known as a single char. Both are significantly faster than the "standard" separation method.

So I was wondering: why was the Java API chosen as it is now? What was a good reason for this, instead of having something like split (char) and split (String) and splitRegex (String) ??

Update: I hit a few calls to find out how long different line split methods will have.

Overview: it makes a big difference!

I did 10,000,000 iterations for each test case, always using input

"aap,noot,mies,wim,zus,jet,teun"

and always use ',' or "," as the argument to split.

This is what I got on my Linux system (this is the Atom D510 block, so it's a bit slow):

 fastSplit STRING Test 1 : 11405 milliseconds: Split in several pieces Test 2 : 3018 milliseconds: Split in 2 pieces Test 3 : 4396 milliseconds: Split in 3 pieces homegrown fast splitter based on char Test 4 : 9076 milliseconds: Split in several pieces Test 5 : 2024 milliseconds: Split in 2 pieces Test 6 : 2924 milliseconds: Split in 3 pieces homegrown splitter based on char that always splits in 2 pieces Test 7 : 1230 milliseconds: Split in 2 pieces String.split(regex) Test 8 : 32913 milliseconds: Split in several pieces Test 9 : 30072 milliseconds: Split in 2 pieces Test 10 : 31278 milliseconds: Split in 3 pieces String.split(regex) using precompiled Pattern Test 11 : 26138 milliseconds: Split in several pieces Test 12 : 23612 milliseconds: Split in 2 pieces Test 13 : 24654 milliseconds: Split in 3 pieces StringTokenizer Test 14 : 27616 milliseconds: Split in several pieces Test 15 : 28121 milliseconds: Split in 2 pieces Test 16 : 27739 milliseconds: Split in 3 pieces

As you can see, this is of great importance if you have many "fixed char" partitions.

To give you guys insight; I am currently in the Apache log files and the Hadoop arena with data from a large website. So this material really matters to me :)

Something I didn’t take into account here is the garbage collector. As far as I can tell, compiling a regular expression in Pattern / Matcher / .. will highlight a lot of objects that need to be collected for some time. Therefore, it is possible that in the end the differences between these versions are even greater .... or less.

My findings so far:

Only optimize this if you have many lines to split.
If you use regex methods, always recompile if you reuse the same template.
Forget (deprecated) StringTokenizer
If you want to split into one char, use a special method, especially if you only need to split it into a certain number of parts (for example, ... 2).

PS I give you all my native schism using char methods to play (under the license that everything on this site falls under :)). I have never tested them. Enjoy.

 private static String[] stringSplitChar(final String input, final char separator) { int pieces = 0; // First we count how many pieces we will need to store ( = separators + 1 ) int position = 0; do { pieces++; position = input.indexOf(separator, position + 1); } while (position != -1); // Then we allocate memory final String[] result = new String[pieces]; // And start cutting and copying the pieces. int previousposition = 0; int currentposition = input.indexOf(separator); int piece = 0; final int lastpiece = pieces - 1; while (piece < lastpiece) { result[piece++] = input.substring(previousposition, currentposition); previousposition = currentposition + 1; currentposition = input.indexOf(separator, previousposition); } result[piece] = input.substring(previousposition); return result; } private static String[] stringSplitChar(final String input, final char separator, final int maxpieces) { if (maxpieces <= 0) { return stringSplitChar(input, separator); } int pieces = maxpieces; // Then we allocate memory final String[] result = new String[pieces]; // And start cutting and copying the pieces. int previousposition = 0; int currentposition = input.indexOf(separator); int piece = 0; final int lastpiece = pieces - 1; while (currentposition != -1 && piece < lastpiece) { result[piece++] = input.substring(previousposition, currentposition); previousposition = currentposition + 1; currentposition = input.indexOf(separator, previousposition); } result[piece] = input.substring(previousposition); // All remaining array elements are uninitialized and assumed to be null return result; } private static String[] stringChop(final String input, final char separator) { String[] result; // Find the separator. final int separatorIndex = input.indexOf(separator); if (separatorIndex == -1) { result = new String[1]; result[0] = input; } else { result = new String[2]; result[0] = input.substring(0, separatorIndex); result[1] = input.substring(separatorIndex + 1); } return result; }

+42

java string regex language-design

Niels Basjes Jul 29 '10 at 12:55

source share

9 answers

I would not say that most string manipulations are based on regular expressions in Java. In fact, it is only about split and replaceAll / replaceFirst . But I agree, this is a big mistake.

Besides the ugliness, when a low-level language function (string) becomes dependent on a higher-level function (regular expression), it is also an unpleasant trap for new users who, of course, can assume that the method with the signature String.replaceAll(String, String) is a string replacement function. Code written in accordance with this assumption will look as if it works until special regular expression characters appear, and at this point you have confusing, difficult to debug (and maybe even important for security) mistakes.

It's funny that a language that can be so pedantically strict with respect to input made a messy mistake by treating a string and a regular expression as the same thing. It is less fun that there is still no built-in method for simply replacing or splitting a string. You should use regular expression replacement with the string Pattern.quote d. And you even get it only with Java 5. Hopeless.

@Tim Pietzcker:

Are there other languages that do the same?

JavaScript strings are partially modeled in Java and are also messy in the case of replace() . Turning into a string, you get a replacement for a simple string, but it replaces only the first match, which is rarely necessary. To get a replacement, you must pass a RegExp object with the /g flag, which again has problems if you want to dynamically create it from a string (there is no RegExp.quote built-in method in JS). Fortunately, split() is purely string, so you can use the idiom:

 s.split(findstr).join(replacestr)

Plus, of course, Perl does absolutely everything with regexen because it is just perverted in this way.

(This comment is more than an answer, but too big for one. Why did Java do this? I don’t know, they had a lot of mistakes in the early days. Some of them were fixed. I suspect they thought about adding regular expression functionality in the box labeled Pattern in 1.0, the String design will be cleaner to fit.)

+12

bobince Jul 29 '10 at 13:26

source share

I guess the good reason is that they can simply pass buck to the regex method, which does all the real hard lifting for all string methods. I assume that they thought that if they already had a working solution, it was less efficient, from the point of view of development and maintenance, to invent a wheel for each string manipulation method.

+2

Scott M. Jul 29 '10 at 13:11

source share

Interesting discussion!

Java was not originally intended as a batch programming language. Thus, the API out of the box is more configured to perform one “replacement”, one “parsing”, etc., with the exception of initializing the application, when the application may have parsing heaps of configuration files.

Consequently, the optimization of these APIs was sacrificed at the altar of simplicity of IMO. But the question raises an important point. Python’s desire to keep the regex different from non-regex in its API is because Python can be used as a great scripting language. On UNIX, source versions of fgrep did not support regex.

I was involved in a project where we needed to do some ETL work in java. At that time, I remember that I came up with what you mentioned in my question.

+2

raja kolluru Aug 03 '10 at 3:00

source share

When looking at the Java String class, using a regular expression seems reasonable, and there are alternatives if the regular expression is undesirable:

http://java.sun.com/javase/6/docs/api/java/lang/String.html

boolean matches(String regex) - A regular expression seems appropriate, otherwise you could just use equals

String replaceAll/replaceFirst(String regex, String replacement) - There are equivalents that accept CharSequence instead, preventing regex.

String[] split(String regex, int limit) - A powerful, but expensive split, you can use StringTokenizer to separate with tokens.

These are the only functions I have seen that have taken a regular expression.

Edit: Seeing that StringTokenizer is deprecated, I postponed Péter Török's answer to precompile the regular expression for split instead of using a tokenizer.

+1

Brandon Horsley Jul 29 '10 at 13:16

source share

I suspect that the reason that things like String # split (String) uses regexp under the hood is because it includes less extraneous code in the Java class library. The state machine that results from dividing into something like , or space, is so simple that it is unlikely to be much slower to execute than the statically implemented equivalent using StringCharacterIterator.

In addition, a statically implemented solution will complicate the optimization of runtime using JIT, because it will be another block of code that also requires analysis of the hot code. Using existing Pattern algorithms regularly in the library means that they are more likely candidates for JIT compilation.

+1

Alain O'Dea Jul 29 '10 at 13:18

source share

Very good question.

I suppose when the designers sat down to look at this (and not very long, it seems), they came to him in terms of the fact that it should be designed so that it can use as many different possibilities as possible. Regular expressions offer flexibility.

They did not think in terms of efficiency. There is a Java Community Process to enhance this.

You have considered using the java.util.regex.Pattern class, in which you compile the expression once, and then use different lines.

 Pattern exp = Pattern.compile(":"); String[] array = exp.split(sourceString1); String[] array2 = exp.split(sourceString2);

+1

Adrian Regan Jul 29 '10 at 13:20

source share

The answer to your question is that the Java kernel API did it wrong. For everyday work, you might consider using the CharMatcher library of Guava, which fills the gap perfectly.

0

user18943 Jul 29 '10 at 13:19

source share

... why was the Java API chosen as it is now?

Short answer: this is not so. No one ever dared to use regular expression methods for methods other than regex in the String API, it just worked that way.

I always realized that Java developers deliberately kept string manipulation methods to a minimum to avoid bloating the API. But when regular expression support appeared in JDK 1.4, they, of course, had to add some convenient methods to the String API.

So now, users are faced with the choice between the extremely powerful and flexible regular methods and the basic methods that Java has always offered.

0

Alan Moore Jul 31 '10 at 9:14

source share

Péter Török · Accepted Answer · 2010-07-29 13:12

Note that the regular expression does not need to be recompiled every time. From Javadoc :

Calling this method of the form str.split(regex, n) gives the same result as the expression

 Pattern.compile(regex).split(str, n)

That is, if you are concerned about performance, you can pre-build the template and then reuse it:

 Pattern p = Pattern.compile(regex); ... String[] tokens1 = p.split(str1); String[] tokens2 = p.split(str2); ...

instead

 String[] tokens1 = str1.split(regex); String[] tokens2 = str2.split(regex); ...

I believe that the main reason for this API design is usability. Since regular expressions include all "fixed" strings / characters, this makes it easy for the API to use one method instead of several. And if someone is worried about performance, the regex can still be precompiled, as shown above.

My feeling (which I cannot confirm with any statistics) is that most cases of String.split() are used in a context where performance is not an issue. For example. this is a one-time action, or the difference in performance is negligible compared to other factors. IMOs are rare cases where you break lines using the same regular expression thousands of times in a tight loop where performance optimization really makes sense.

It would be interesting to see a comparison of the performance of the implementation of regular expression matching with fixed strings / characters compared with the comparison of a specialist specialized with them. The difference may not be large enough to warrant a single implementation.

Why is most string manipulation in Java based on regexp?

More articles: