Tell SAX Parser to ignore invalid characters?

SAX continues to die in the following exception:

Invalid byte 2 of 3-byte UTF-8 sequence

The problem is that the UTF-8 encoding is mostly encoded, but there are a few errors in it. We cannot get a new version of the file, we must use this file.

So, how can we tell SAX to ignore invalid character sequences or clear the UTF-8 file so that it does not have invalid UTF-8 sequences?

+3
source share
5 answers

I would suggest that you clear the file as a completely separate step from parsing it as XML.

UTF-8 - ; - , UTF-8. , . , , , . , "UTF8ERROR" - , . .

, , ... .

, , , - , , . , ... .

+2

, SAX . InputStream, .

+3

SAX ( XML) ( ) XML. , , . .

( SAX HTML, XML, , ).

+1

, , , , :

XML UTF-8, ISO-8859-1. , UTF-8 String.getBytes(charset):

public Document parseRequest(HttpServletRequest request) {
   DocumentBuilderFactory builder = DocumentBuilderFactory.newInstance();

   DataInputStream dataStream = new DataInputStream(request.getInputStream());
   String xml = dataStream.readUTF();
   ByteArrayInputStream byteStream = new ByteArrayInputStream(xml.getBytes("UTF-8"));
   return builder.newDocumentBuilder().parse(byteStream);
}

:.. :

public Document parseRequest(HttpServletRequest request) {
   DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();

   Reader reader = new InputStreamReader(request.getInputStream(), "UTF-8");
   InputSource source = new InputSource(reader);
   return domFactory.newDocumentBuilder().parse(source);
}
0

- java.nio.charset.CharsetDecoder InputStreamReader (InputStream in, CharsetDecoder dec)?

How a decoding error is handled depends on the actions required for this type of error, which is described by an instance of the CodingErrorAction class. Possible errors should ignore erroneous input, report an invoker error via the returned CoderResult, or replace erroneous input with the current value of the replacement string. replacement has an initial value of "\ UFFFD"; its value can be changed using the replaceWith method.

(from javaadoc CharsetDecoder)

0
source

Source: https://habr.com/ru/post/1720425/


All Articles