SAX XML Java Entities Problem

I have a problem with SAX and Java .

I am parsing the dblp digital library database XML file (which lists the journal, conferences, document). The XML file is very large (> 700 MB).

However, my problem is that when the characters () callback is returned , if the found string contains several objects, the method returns only the string, starting from the last found characters of the entities.

ie: R&uuml;diger Meckeis the original name of the author, enclosed between tags<author>

รผdiger Mecke - result

(The string is returned from the characters (ch [], start, length)).

I'd like to know:

  • How to prevent PArser from automatically resolving entities?
  • How to solve the truncated character problem described earlier?
+3
source share
2 answers

characters()all characters are not guaranteed to be returned in a single call. From Javadoc:

Parser will call this method to report each piece of character data. SAX parsers can return all contiguous character data in one piece, or they can split it into several pieces.

You need to add the characters returned in all calls, for example:

private StringBuffer tempValue = new StringBuffer();

startElement()
{
    tempValue.setLength(0); // clear buffer...
}

characters(characters(char[] ch, int start, int length)
{
    tempValue.append(ch, start, length); // append to buffer
}

endElement()
{
    String value = tempValue.toString(); // use characters in buffer...
}
+4
source
  • I do not think you can disable entity resolution.

  • The symbol method can be called several times for a single tag, and you need to collect symbols for several calls, rather than expecting them to appear all at once.

+2
source

Source: https://habr.com/ru/post/1782625/


All Articles