Replacing multiple substrings in large strings

Question

Replacing multiple substrings in large strings

One of our module capabilities is highly dependent on how we replace substrings in a string.

We form a “replacement card” that can contain more than 3500 pairs of strings, and then apply it with StringUtils.replaceEach(text, searchList, replacementList) to large strings (several MB).

Keys and values are unique and in most cases have the same character length (but this is not something we can rely on).

Is there a more complicated approach to my task than StringUtils.replaceEach() ? Something that might be redundant for simple replacements solved by replaceEach() , but which is much faster in my "difficult" case.

+6

java performance string replace

okutane Nov 02 '16 at 18:03

source share

6 answers

You can use the regexp mechanism to efficiently match your keys to the input line and replace them.

First connect all of your keys using the interleave operator, for example:

 var keys = "keyA|keyB|keyC";

Then compile the template:

 Pattern pattern = Pattern.compile("(" + keys + ")")

Create a match for the input text:

 Matcher matcher= pattern.matcher(text);

Now apply your regular expression in a loop to find all the keys in the text and use appendReplacement (which is the "built-in" method of replacing strings) to replace all with their corresponding value:

 StringBuffer sb = new StringBuffer(); while (matcher.find()) { matcher.appendReplacement(sb,dictionary.get(matcher.group(0))); } matcher.appendTail(sb);

And here you go.

Please note that at first this may seem a bit naive, but the regexp mechanism is highly optimized for the task at hand, and since the Java regexp implementation also allows for “built-in” replacements, everything works very well.

I did a little test by applying a list of color names (~ 200 different color names), as defined in / usr / share / X 11 / rgb.txt against Crime and Punishment by Fedor Dostoevsky, I downloaded from Project Gutenberg (~ 1 MB in size ) using the technique described and it worked around

x12 times faster than StringUtils.replaceEach - 900ms vs 10700 ms

for the latter (not counting the compilation time of the template).

PS if your keys may contain characters unsafe for regexp, for example. ^ $ (), you must use Pattern.quote () before adding them to your template.

Sidenote:

This method will replace the keys in the order they appear in the list of templates, for example. "a => 1 | b => 2 | aa => 3" when applied to "welcome to the bazaar" will result in "welcome to b1z11r" rather than "welcome to b1z3r" if you want to get the most long match, you must sort your keys lexicographically before adding them to the template (ie "b | aa | a"). This also applies to your original StringUtils.replaceEach () method.

Update:

The method above should work well for the problem as stated in the original question, i.e. when the size of the card replacement is (relatively) small compared to the size of the input data.

If instead you have a very long dictionary applied to short text, the linear search performed by StringUtils.replaceEach () may be faster than it.

I made an additional landmark illustrating this, using a dictionary of 10,000 randomly selected words (+4 characters):

 cat /usr/share/dict/words | grep -E "^.{4,}$" | shuf | head -10000

against: 1024,2048,4096,8192,16384,32768,65536,131072,262144 and 524288 characters in long extracts from the same text “Crime and Punishment”.

The results are shown below:

 text Ta(ms) Tb(ms) Ta/Tb(speed up) --------------------------------------- 1024 99 240 0.4125 2048 43 294 0.1462585 4096 113 721 0.1567267 8192 128 1329 0.0963130 16384 320 2230 0.1434977 32768 2052 3708 0.5533981 65536 6811 6650 1.0242106 131072 32422 12663 2.5603728 262144 150655 23011 6.5470862 524288 614634 29874 20.574211

Ta - StringUtils.replaceEach () time
Tb - matcher.appendReplacement () time

Please note that the string length of the template is 135537 bytes (all 10000 keys are combined)

+5

zeppelin Nov 02 '16 at 20:01

source share

First of all - if you are talking about optimization, publish profiling results. This is the only reliable source of information about what should be optimized (see the Third optimization rule ).

If you determine that string operations take the most time, then there are two things to keep in mind.

First of all, Java strings are immutable. Each time you call replace, you create a new line, which, in your case, most likely means a lot of memory allocation. Since then, Java has gotten better with this, but if you can skip it, do it. I checked StringUtils.replaceEach uses a buffer and should be relatively memory efficient. In addition, especially with the custom search algorithm from the second note, you can implement your own replacement solution. A custom solution might be to create your own char buffer for efficient replacement using StringBuilder / StringBuffer for replacement (you will need to keep track of substitution lengths, since calling .toString() before each search in StringBuffer will be just as inefficient as replacing strings manually).

Secondly, there is a search algorithm . I do not know that Apache StringUtils , but the default implementation of Java is not optimal. You can use a separate library for searching .

0

Dariusz Nov 02 '16 at 18:25

source share

StringUtils uses the O(n * m) algorithm (for each word that needs to be replaced, make a replacement in the input). When m (the number of words to be replaced) is small, it is effectively O(n) (input size).

However, with a “large” number of substitutions to check, you will probably be better off processing each input word that will be completed in O(n) time.

 Map<String, String> subs = new HashMap<>(); // populated String replaced = Arrays.stream(input.split("\\b")) // O(n) .map(w -> subs.getOrDefault(w, w)) // O(1) .collect(Collectors.joining("")); // O(n)

Separation at word boundaries not only preserves spaces (without consuming input), but also makes the code pretty simple.

0

Bohemian Nov 02 '16 at 18:42

source share

The best method to solve this situation: Precompilation of source lines into code. Scanning each of the source lines for replacement keys; split a string into a series of code fragments with the function of inserting the key result into a stream For example: The following source line:

 The quick $brown $fox jumped over the $lazy dog.

becomes

 public StringBuilder quickBrown(Map<String, String> dict) { StringBuilder sb = new StringBuilder(); sb.append("The quick "); sb.append(dict.getOrElse("$brown", "brown")); sb.append(" "); sb.append(dict.getOrElse("$fox", "fox")); sb.append(" jumped over the "); sb.append(dict.getOrElse("$lazy", "lazy"); sb.append(" dog."); return sb; }

Then you call the method corresponding to the specific line, with the dictionary of mappings that you want to replace.

Note that with "scan" and "translate" I mean using the program to generate Java code, and then dynamically load the compiled class files as needed.

0

Bob dalgleish Nov 02 '16 at 18:47

source share

The slowest part of this algorithm finds all matches. Replacing is simple if done in a reasonable way (that is, in a temporary char buffer, only shifting each character at most once).

So, your question simplifies the "multi-line search", which is already a well-studied problem. You may find a good summary of the approaches on this subject - but a one-line summary is "grep does a good job."

Zeppelin has already shown a reasonable loop for this - the behavior of appendReplacement ensures that you will not move things unnecessarily (which will degrade this to O(n) ).

0

BeeOnRope Nov 02 '16 at 20:19

source share

okutane · Accepted Answer · 2016-11-13T16:56:21+0000

Although the appendReplacement solution proposed by @zeppelin was unexpectedly fast on the “heaviest piece of data”, it turned out to be a nightmare with a larger map.

The best solution so far has turned out to be an integral part of what we had (StringUtils.replaceEach) and what was proposed:

 protected BackReplacer createBackReplacer(Map<ReplacementKey, String> replacementMap) { if (replacementMap.isEmpty()) { return new BackReplacer() { @Override public String backReplace(String str) { return str; } }; } if (replacementMap.size() > MAX_SIZE_FOR_REGEX) { final String[] searchStrings = new String[replacementMap.size()]; final String[] replacementStrings = new String[replacementMap.size()]; int counter = 0; for (Map.Entry<ReplacementKey, String> replacementEntry : replacementMap.entrySet()) { searchStrings[counter] = replacementEntry.getValue(); replacementStrings[counter] = replacementEntry.getKey().getValue(); counter++; } return new BackReplacer() { @Override public String backReplace(String str) { return StringUtils.replaceEach(str, searchStrings, replacementStrings); } }; } final Map<String, String> replacements = new HashMap<>(); StringBuilder patternBuilder = new StringBuilder(); patternBuilder.append('('); for (Map.Entry<ReplacementKey, String> entry : replacementMap.entrySet()) { replacements.put(entry.getValue(), entry.getKey().getValue()); patternBuilder.append(entry.getValue()).append('|'); } patternBuilder.setLength(patternBuilder.length() - 1); patternBuilder.append(')'); final Pattern pattern = Pattern.compile(patternBuilder.toString()); return new BackReplacer() { @Override public String backReplace(String str) { if (str.isEmpty()) { return str; } StringBuffer sb = new StringBuffer(str.length()); Matcher matcher = pattern.matcher(str); while (matcher.find()) { matcher.appendReplacement(sb, replacements.get(matcher.group(0))); } matcher.appendTail(sb); return sb.toString(); } }; }

StringUtils Algorithm (MAX_SIZE_FOR_REGEX = 0):

type = TIMER, name = *. run, count = 8127, min = 4.239809, max = 4235197.925261, mean = 645.736554, stddev = 47197.97968925558, duration_unit = milliseconds

appendReplace algorithm (MAX_SIZE_FOR_REGEX = 1000000):

type = TIMER, name = *. run, count = 8155, min = 4.374516, max = 7806145.439165999, mean = 1145.757953, stddev = 86668.38562815856, duration_unit = milliseconds

Mixed solution (MAX_SIZE_FOR_REGEX = 5000):

type = TIMER, name = *. run, count = 8155, min = 3.5862789999999998, max = 376242.25076799997, mean = 389.68986564688714, stddev = 11733.9997814448, duration_unit = milliseconds

Our data:

 type=HISTOGRAM, name=initialValueLength, count=569549, min=0, max=6352327, mean=6268.940661478599, stddev=198123.040651236, median=12.0, p75=16.0, p95=32.0, p98=854.0, p99=1014.5600000000013, p999=6168541.008000023 type=HISTOGRAM, name=replacementMap.size, count=8155, min=0, max=65008, mean=73.46108949416342, stddev=2027.471388983965, median=4.0, p75=7.0, p95=27.549999999999955, p98=55.41999999999996, p99=210.10000000000036, p999=63138.68900000023

This change reduced the time in StringUtils.replaceEach in the previous solution and gave us a 25% performance improvement in our module, which was mainly related to IO.

Replacing multiple substrings in large strings

More articles: