Java removes punctuation on a String (also "" and all that), preserving accent characters

I need to remove punctuation in the file, preserving the character of the accents. I tried this code, but did not work as I would like.

Expectation: input=> ’'qwe..,rty ‘èeéò’" "o" "à     output=> qwertyèeéòoà

Effective result: input=> ’'qwe..,rty ‘èeéò’" "o" "à   output=>’qwerty ‘èeéò’" "o" "à

I can not delete characters ’""and other of them

Note: Eclipseand filetext.txtset to UTF-8.

thank

import java.io.*;
import java.util.Scanner;

public class DataCounterMain {
    public static void main (String[] args) throws FileNotFoundException {

    File file = new File("filetext.txt");

    try {
        Scanner filescanner = new Scanner(file);
        while (filescanner.hasNextLine()) {

            String line = filescanner.nextLine();
            line=line.replaceAll ("\\p{Punct}", "");

            System.out.println(line);
        }
    }
    catch(FileNotFoundException e) {
        System.err.println(file +" FileNotFound");
    }
    }
}
+4
source share
1 answer

The \p{Punct}default regular expression matches only US-ASCII punctuation by default if you do not activate Unicode character classes. This means that your code, as written, will only remove these characters:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Unicode, , \p{IsPunctuation}, Unicode ( !).

, , , :

line = line.replaceAll("\\p{IsPunctuation}|\\p{IsWhite_Space}", "");
+5

Source: https://habr.com/ru/post/1689500/


All Articles