Why does SAXParser read so much before throwing events?

Scenario: I get a huge xml file through an extremely slow network, so I want to start over processing as early as possible. Because of this, I decided to use SAXParser.

I expected that upon completion of the tag I would receive an event.

The following test shows what I mean:

@Test public void sax_parser_read_much_things_before_returning_events() throws Exception{ String xml = "<a>" + " <b>..</b>" + " <c>..</c>" // much more ... + "</a>"; // wrapper to show what is read InputStream is = new InputStream() { InputStream is = new ByteArrayInputStream(xml.getBytes()); @Override public int read() throws IOException { int val = is.read(); System.out.print((char) val); return val; } }; SAXParser parser = SAXParserFactory.newInstance().newSAXParser(); parser.parse(is, new DefaultHandler(){ @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { System.out.print("\nHandler start: " + qName); } @Override public void endElement(String uri, String localName, String qName) throws SAXException { System.out.print("\nHandler end: " + qName); } }); } 

I wrapped the input stream to see what is being read and when the events occur.

I was expecting something like this:

 <a> <- output from read() Handler start: a <b> <- output from read() Handler start: b </b> <- output from read() Handler end: b ... 

Unfortunately, the result was as follows:

 <a> <b>..</b> <c>..</c></a> <- output from read() Handler start: a Handler start: b Handler end: b Handler start: c Handler end: c Handler end: a 

Where is my mistake and how can I get the expected result?

Edit:

  • First of all, he tries to detect a version of doc that makes it scan everything. With the doc version, it is torn between them (but not where I expect)
  • It is not good that he "wants" to read, for example, 1000 bytes and blocks for so long, because it is possible that this stream does not contain so much at a given time.
  • I found buffer sizes in XMLEntityManager:
    • public static final int DEFAULT_BUFFER_SIZE = 8192;
    • public static final int DEFAULT_XMLDECL_BUFFER_SIZE = 64;
    • public static final int DEFAULT_INTERNAL_BUFFER_SIZE = 1024;
+5
source share
2 answers

You seem to be making the wrong assumptions about how I / O works. An XML parser, like most programs, will request data in chunks, since requesting single bytes from a stream is a recipe for performance failure.

This does not mean that the buffer must be full before the read attempt returns. It’s just that ByteArrayInputStream unable to emulate the behavior of the InputStream network. You can easily fix this by overriding read(byte[], int, int) and not returning a full buffer, but, for example, one byte for each request:

 @Test public void sax_parser_read_much_things_before_returning_events() throws Exception{ final String xml = "<a>" + " <b>..</b>" + " <c>..</c>" // much more ... + "</a>"; // wrapper to show what is read InputStream is = new InputStream() { InputStream is = new ByteArrayInputStream(xml.getBytes()); @Override public int read() throws IOException { int val = is.read(); System.out.print((char) val); return val; } @Override public int read(byte[] b, int off, int len) throws IOException { return super.read(b, off, 1); } }; SAXParser parser = SAXParserFactory.newInstance().newSAXParser(); parser.parse(is, new DefaultHandler(){ @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { System.out.print("\nHandler start: " + qName); } @Override public void endElement(String uri, String localName, String qName) throws SAXException { System.out.print("\nHandler end: " + qName); } }); } 

Will open

 <a> Handler start: a<b> Handler start: b..</b> Handler end: b <c> Handler start: c..</c> Handler end: c</a> Handler end: a? 

shows how the XML parser adapts to the availability of data from an InputStream .

+2
source

The SAX internally parser most likely wrapped your InputStream in a BufferedReader or uses some kind of buffering. In addition, it will read single bytes from input, which will really degrade performance.

So, you see that the parser reads a fragment from the input, and then processes this part, generates SAX events, etc. ...

+1
source

Source: https://habr.com/ru/post/1234109/


All Articles