There are many methods in Java that are all related to string manipulation. The simplest example is the String.split ("something") method.
Now the actual definition of many of these methods is that they all take a regular expression as their input parameter (s). What then does all the very powerful building blocks do.
Now there are two effects that you will see in many of these methods:
- They recompile the expression every time the method is called. Thus, they affect performance.
- I found that in most situations of "real life" these methods are called "fixed" texts. The most common use of the split method is even worse: it is usually called with a single char (usually a ', a'; 'or' & ') to split.
Thus, it’s not only that the default methods are powerful, they also seem suppressed for what they are actually used for. Inside, we developed the "fastSplit" method, which is split into fixed lines. I wrote a test at home to find out how much faster I could do it if it were known as a single char. Both are significantly faster than the "standard" separation method.
So I was wondering: why was the Java API chosen as it is now? What was a good reason for this, instead of having something like split (char) and split (String) and splitRegex (String) ??
Update: I hit a few calls to find out how long different line split methods will have.
Overview: it makes a big difference!
I did 10,000,000 iterations for each test case, always using input
"aap,noot,mies,wim,zus,jet,teun"
and always use ',' or "," as the argument to split.
This is what I got on my Linux system (this is the Atom D510 block, so it's a bit slow):
fastSplit STRING Test 1 : 11405 milliseconds: Split in several pieces Test 2 : 3018 milliseconds: Split in 2 pieces Test 3 : 4396 milliseconds: Split in 3 pieces homegrown fast splitter based on char Test 4 : 9076 milliseconds: Split in several pieces Test 5 : 2024 milliseconds: Split in 2 pieces Test 6 : 2924 milliseconds: Split in 3 pieces homegrown splitter based on char that always splits in 2 pieces Test 7 : 1230 milliseconds: Split in 2 pieces String.split(regex) Test 8 : 32913 milliseconds: Split in several pieces Test 9 : 30072 milliseconds: Split in 2 pieces Test 10 : 31278 milliseconds: Split in 3 pieces String.split(regex) using precompiled Pattern Test 11 : 26138 milliseconds: Split in several pieces Test 12 : 23612 milliseconds: Split in 2 pieces Test 13 : 24654 milliseconds: Split in 3 pieces StringTokenizer Test 14 : 27616 milliseconds: Split in several pieces Test 15 : 28121 milliseconds: Split in 2 pieces Test 16 : 27739 milliseconds: Split in 3 pieces
As you can see, this is of great importance if you have many "fixed char" partitions.
To give you guys insight; I am currently in the Apache log files and the Hadoop arena with data from a large website. So this material really matters to me :)
Something I didn’t take into account here is the garbage collector. As far as I can tell, compiling a regular expression in Pattern / Matcher / .. will highlight a lot of objects that need to be collected for some time. Therefore, it is possible that in the end the differences between these versions are even greater .... or less.
My findings so far:
- Only optimize this if you have many lines to split.
- If you use regex methods, always recompile if you reuse the same template.
- Forget (deprecated) StringTokenizer
- If you want to split into one char, use a special method, especially if you only need to split it into a certain number of parts (for example, ... 2).
PS I give you all my native schism using char methods to play (under the license that everything on this site falls under :)). I have never tested them. Enjoy.
private static String[] stringSplitChar(final String input, final char separator) { int pieces = 0; // First we count how many pieces we will need to store ( = separators + 1 ) int position = 0; do { pieces++; position = input.indexOf(separator, position + 1); } while (position != -1); // Then we allocate memory final String[] result = new String[pieces]; // And start cutting and copying the pieces. int previousposition = 0; int currentposition = input.indexOf(separator); int piece = 0; final int lastpiece = pieces - 1; while (piece < lastpiece) { result[piece++] = input.substring(previousposition, currentposition); previousposition = currentposition + 1; currentposition = input.indexOf(separator, previousposition); } result[piece] = input.substring(previousposition); return result; } private static String[] stringSplitChar(final String input, final char separator, final int maxpieces) { if (maxpieces <= 0) { return stringSplitChar(input, separator); } int pieces = maxpieces; // Then we allocate memory final String[] result = new String[pieces]; // And start cutting and copying the pieces. int previousposition = 0; int currentposition = input.indexOf(separator); int piece = 0; final int lastpiece = pieces - 1; while (currentposition != -1 && piece < lastpiece) { result[piece++] = input.substring(previousposition, currentposition); previousposition = currentposition + 1; currentposition = input.indexOf(separator, previousposition); } result[piece] = input.substring(previousposition); // All remaining array elements are uninitialized and assumed to be null return result; } private static String[] stringChop(final String input, final char separator) { String[] result; // Find the separator. final int separatorIndex = input.indexOf(separator); if (separatorIndex == -1) { result = new String[1]; result[0] = input; } else { result = new String[2]; result[0] = input.substring(0, separatorIndex); result[1] = input.substring(separatorIndex + 1); } return result; }