Comparison of performance of various regular expressions needed clarifications

Consider 3 regular expression expressions designed to remove non-Latin characters from a string.

String x = "some†Β₯Β₯ΒΆΒΆΛ†ΛšΛšword"; long now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]", "")); // 5ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]+", "")); // 2ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]*", "")); // <1ms System.out.println(System.nanoTime() - now); 

All 3 produce the same result with significantly different performance metrics.

Why is this?

+4
source share
2 answers

The first is slower because the regular expression matches each non-Latin character individually, so replaceAll works with each character individually.

Other patterns correspond to the entire sequence of non-Latin characters, so replaceAll can replace the entire sequence in one go. However, I cannot explain the difference in performance between the two. Perhaps something is related to the difference in processing * and + in the regular expression engine.

+1
source

The latter will replace empty lines with empty lines (unless this is optimized, I don't know the compiler), which seems a bit unnecessary ...; -)

The first will look for much more time than the second, if not Latin characters are adjacent. Otherwise, no. Therefore, I suggest that the time for 1 and 2 may be approximately the same for some texts and longer for 1 in other texts.

+1
source

Source: https://habr.com/ru/post/1394944/


All Articles