Effectively removing certain characters (some punctuation) from strings in Java?

In Java, what is the most efficient way to remove character data from a string? I currently have this code:

private static String processWord(String x) { String tmp; tmp = x.toLowerCase(); tmp = tmp.replace(",", ""); tmp = tmp.replace(".", ""); tmp = tmp.replace(";", ""); tmp = tmp.replace("!", ""); tmp = tmp.replace("?", ""); tmp = tmp.replace("(", ""); tmp = tmp.replace(")", ""); tmp = tmp.replace("{", ""); tmp = tmp.replace("}", ""); tmp = tmp.replace("[", ""); tmp = tmp.replace("]", ""); tmp = tmp.replace("<", ""); tmp = tmp.replace(">", ""); tmp = tmp.replace("%", ""); return tmp; } 

Would it be faster if I used some StringBuilder or regular expression or maybe something else? Yes, I know: comment on this and look, but I hope that someone can give an answer to the top of the head, as this is a common task.

+6
source share
7 answers

Here is a late answer, just for fun.

In such cases, I would suggest increasing speed readability. Of course, you can be super readable, but too slow, as in this super-compressed version:

 private static String processWord(String x) { return x.replaceAll("[][(){},.;!?<>%]", ""); } 

This is slow because every time you call this method, regex will be compiled. Therefore, you can precompile the regular expression.

 private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]"); private static String processWord(String x) { return UNDESIRABLES.matcher(x).replaceAll(""); } 

This should be fast enough for most purposes, assuming the JVM regex engine optimizes character class searches. This is a solution that I would use personally.

Now, without profiling, I don’t know if you can do better by creating a table of your own characters (actually a code one):

 private static final boolean[] CHARS_TO_KEEP = new boolean[]; 

Fill it out once, and then iterate, creating your final row. I will leave the code for you. :)

Again, I would not dive into such an optimization. The code has become too difficult to read. Is performance such a problem? Also remember that modern languages ​​are JITted, and after warming up they will work better, so use a good profiler.

One thing that should be mentioned is that the example in the original question is very inactive because you are creating a whole bunch of temporary lines! If the compiler does not optimize all of this, this particular solution will be the worst.

+12
source

Although \\p{Punct} will indicate a wider range of characters than in the question, it does allow a shorter replacement expression to be used:

 tmp = tmp.replaceAll("\\p{Punct}+", ""); 
+18
source

You can do something like this:

 static String RemovePunct(String input) { char[] output = new char[input.length()]; int i = 0; for (char ch : input.toCharArray()) { if (Character.isLetterOrDigit(ch) || Character.isWhitespace(ch)) { output[i++] = ch; } } return new String(output, 0, i); } // ... String s = RemovePunct("This is (a) test string."); 

This will most likely work better than using regular expressions if you find that they slow down for your needs.

However, it can become messy if you have a long, excellent list of special characters that you want to remove. In this case, regular expressions are easier to handle.

http://ideone.com/mS8Irl

+5
source

Strings are immutable, so it’s not good to try and use them very dynamically, using StringBuilder instead of String and use all your wonderful methods! This will allow you to do anything you want. Plus, yes, if you have something you're trying to do, figure out a regular expression for it, and it will work much better for you.

+1
source

Use String#replaceAll(String regex, String replacement) as

 tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", ""); System.out.println( "f,il;t!e?r(e)d {s}t[r]i<n>g%".replaceAll( "[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string" 
0
source

Now your code will iterate over all tmp characters and compare them with all possible characters you want to remove, so it will use number of tmp characters x number or characters you want to remove .

To optimize the code, you can use a short circuit OR || and do something like

 StringBuilder sb = new StringBuilder(); for (char c : tmp.toCharArray()) { if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?' || c == '(' || c == ')' || c == '{' || c == '}' || c == '[' || c == ']' || c == '<' || c == '>' || c == '%')) sb.append(c); } tmp = sb.toString(); 

or how is it

 StringBuilder sb = new StringBuilder(); char[] badChars = ",.;!?(){}[]<>%".toCharArray(); outer: for (char strChar : tmp.toCharArray()) { for (char badChar : badChars) { if (badChar == strChar) continue outer;// we skip `strChar` since it is bad character } sb.append(strChar); } tmp = sb.toString(); 

Thus, you will iterate over all tmp symbols, but the number of comparisons for this symbol may decrease if it is not % (since this will be the last comparison if the symbol is . , The program will get its result in one comparison).


If I'm not mistaken, this approach is used with a character class ( [...] ), so maybe try it this way

 Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so //you wont need to compile it again tmp = p.matcher(tmp).replaceAll(""); 
0
source

You can do it:

 tmp.replaceAll("\\W", ""); 

remove punctuation

-1
source

Source: https://habr.com/ru/post/948962/


All Articles