Comparison of performance of various regular expressions needed clarifications

Question

Comparison of performance of various regular expressions needed clarifications

Consider 3 regular expression expressions designed to remove non-Latin characters from a string.

String x = "some†¥¥¶¶ˆ˚˚word"; long now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]", "")); // 5ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]+", "")); // 2ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]*", "")); // <1ms System.out.println(System.nanoTime() - now);

All 3 produce the same result with significantly different performance metrics.

Why is this?

+4

java regex

Jam Feb 06 '12 at 3:32

source share

2 answers

The latter will replace empty lines with empty lines (unless this is optimized, I don't know the compiler), which seems a bit unnecessary ...; -)

The first will look for much more time than the second, if not Latin characters are adjacent. Otherwise, no. Therefore, I suggest that the time for 1 and 2 may be approximately the same for some texts and longer for 1 in other texts.

+1

Leo Feb 06 '12 at 3:42

source share

Andrew Cooper · Accepted Answer · 2012-02-06T03:39:29+0000

The first is slower because the regular expression matches each non-Latin character individually, so replaceAll works with each character individually.

Other patterns correspond to the entire sequence of non-Latin characters, so replaceAll can replace the entire sequence in one go. However, I cannot explain the difference in performance between the two. Perhaps something is related to the difference in processing * and + in the regular expression engine.

Comparison of performance of various regular expressions needed clarifications

More articles: