Remove characters unsuitable for UTF-8 encoding from String

I have a text area on the site where the user can write anything. The problem occurs when the user copies some text or something that contains non-UTF 8 characters and sends them to the server.

Java handles it successfully because it supports UTF-16, but my mySql table supports UTF-8 and therefore the insertion fails.

I tried to somehow implement in the business logic itself to remove any characters that are not suitable for UTF-8 encoding.

I am currently using this code:

new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array()); 

But it replaces characters that are not suitable for UTF-8, with some other obscure characters. Which is also not suitable for the end user. Can anyone comment on any possible solution to this problem using Java code?

EDIT: For example, the exception I received while inserting such values

 java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A\x0D\x0A...' for column java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x80\xF0\x9F...' for column 
+5
source share
4 answers

UTF-8 is not a character set; it is a character encoding like UTF-16.

UTF-8 is able to encode any Unicode character and any Unicode text into a sequence of bytes, so there are no characters that are not suitable for UTF-8.

You use the String constructor, which only accepts an array of bytes ( String (byte [] bytes) ), which is consistent with javadocs:

Creates a new String by decoding the specified byte array using the default platform encoding .

It uses the platform's default encoding for interpreting bytes (to convert bytes to characters). Do not use this. Instead, when converting a byte array to String specify the encoding that you want to explicitly use with the String (byte [] bytes, charset charset) constructor.

If you have problems with certain characters, this is most likely due to the use of different character sets or encodings on the server side and client side (brownser + HTML). Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.

Some indications on how to achieve this:

How to get UTF-8 to work in Java Webapps?

+7
source

Perhaps the answer with CharsetDecoder of this question helps. You can change the CodingErrorAction for REPLACEMENT and set the replacement in my example "?". This will result in a given replacement string for invalid byte sequences. In this example, the UTF-8 decoding capability and the stress test file are read and decoded:

 CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder(); utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE); utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); utf8Decoder.replaceWith("?"); // Read stress file Path path = Paths.get("<path>/UTF-8-test.txt"); byte[] data = Files.readAllBytes(path); ByteBuffer input = ByteBuffer.wrap(data); // UTF-8 decoding CharBuffer output = utf8Decoder.decode(input); // Char buffer to string String outputString = output.toString(); System.out.println(outputString); 
+2
source

The problem in the code is that you are calling new String on byte[] . The result of encode is ByteBuffer, and the result of array on ByteBuffer is byte[] . The new String(byte[]) constructor new String(byte[]) will use the default platform encoding for your computer; It can be different on every computer you work on, so that is not what you want. You should at least pass the character set as the second argument to the String constructor, although I'm not sure which character set you need to keep in mind.

I'm not sure why you are doing this: if your database uses UTF-8, it will do the encoding for you. You just need to pass uncoded strings to it.

UTF-8 and UTF-16 can encode the entire Unicode 6 character set; there are no characters that can be encoded with UTF-16, but not UTF-8. So part of your question, unfortunately, is incontrovertible.

For background:

+1
source

I think this may be useful for you. Easy way to remove UTF-8 chords from a string?

Try using Normalizer as,

 s = Normalizer.normalize(s, Normalizer.Form.NFD); 
0
source

Source: https://habr.com/ru/post/1210455/


All Articles