How to remove SUB control character (HEX: 1A) in Java using regular expressions?

Question

How to remove SUB control character (HEX: 1A) in Java using regular expressions?

I have a file with bad data (a few random SUB control characters, which in themselves ... they are not part of the grapheme), and I tried to delete them using the regular expression search pattern:

Text to Find: \x1a Replace with:

This removes my SUB characters, but it also messes up my other accented characters (é and í).

Is there a regular expression that will remove the SUB control character (code point) if it is on its own? (e.g. not part of grapheme)

SAMPLES DATA (replace wherever you see “␚” with the SUB control character:

 A,André,Fernandez A,Daniel,O␚Shea A,Ibhlín,Flanders A,Donny,O␚'Donnell A,Spencer,O'Maley

SAMPLE DATA Output if I use the current regular expression:

 A,Andr ,Fernandez A,Daniel,OShea A,Ibhl n,Flanders A,Donny,O'Donnell A,Spencer,O'Maley

DESIRED DATA OUTPUT

 A,André,Fernandez A,Daniel,OShea A,Ibhlín,Flanders A,Donny,O'Donnell A,Spencer,O'Maley

+4

java regex

Colorado techie Aug 30 '13 at 18:28

source share

2 answers

  Position Decimal Name Appearance 0x241A 9242 SYMBOL FOR SUBSTITUTE ␚

unicode chart

Perhaps this may help you.

Along with the.

Unicode regular expressions

+1

progrenhard Aug 30 '13 at 19:24

source share

Colorado techie · Accepted Answer · 2013-09-05T16:53:35+0000

Jim Harrison's comment is the answer: the regular expression correctly removes the sub, but the encoding changes in the process.

In addition, I use the Boomi product call, and I used the built-in "Search / Replace" function in Boomi. Java works under the hood, so I did not answer the ajb question about Java code, since I did not know which code was executing.

As we face this problem, we will consider writing some custom Java code to replace a character instead of using Boomi's built-in search / replace function.

THANKS for your help and pointing me in the right direction!

UPDATE: I just found a built-in function in Boomi: Character Decoding. Turns out I can control the encoding without writing any special Java code.

How to remove SUB control character (HEX: 1A) in Java using regular expressions?

More articles: