How to remove SUB control character (HEX: 1A) in Java using regular expressions?

I have a file with bad data (a few random SUB control characters, which in themselves ... they are not part of the grapheme), and I tried to delete them using the regular expression search pattern:

Text to Find: \x1a Replace with: 

This removes my SUB characters, but it also messes up my other accented characters (é and í).

Is there a regular expression that will remove the SUB control character (code point) if it is on its own? (e.g. not part of grapheme)

SAMPLES DATA (replace wherever you see “␚” with the SUB control character:

 A,André,Fernandez A,Daniel,O␚Shea A,Ibhlín,Flanders A,Donny,O␚'Donnell A,Spencer,O'Maley 

SAMPLE DATA Output if I use the current regular expression:

 A,Andr ,Fernandez A,Daniel,OShea A,Ibhl n,Flanders A,Donny,O'Donnell A,Spencer,O'Maley 

DESIRED DATA OUTPUT

 A,André,Fernandez A,Daniel,OShea A,Ibhlín,Flanders A,Donny,O'Donnell A,Spencer,O'Maley 
+4
source share
2 answers

Jim Harrison's comment is the answer: the regular expression correctly removes the sub, but the encoding changes in the process.

In addition, I use the Boomi product call, and I used the built-in "Search / Replace" function in Boomi. Java works under the hood, so I did not answer the ajb question about Java code, since I did not know which code was executing.

As we face this problem, we will consider writing some custom Java code to replace a character instead of using Boomi's built-in search / replace function.

THANKS for your help and pointing me in the right direction!

UPDATE: I just found a built-in function in Boomi: Character Decoding. Turns out I can control the encoding without writing any special Java code.

0
source
  Position Decimal Name Appearance 0x241A 9242 SYMBOL FOR SUBSTITUTE ␚ 

unicode chart

Perhaps this may help you.

Along with the.

Unicode regular expressions

+1
source

Source: https://habr.com/ru/post/1499850/


All Articles