Which XML parser can I use here?

I get an XML file as input, the size of which can vary from a few KB to much larger. I get this file over the network. I need to extract a small number of nodes according to my use, so most of the document is useless to me. I have no memory settings, I just need speed.

Given all this, I concluded:

  • Do not use the DOM here (due to the possible huge size of the document, the CRUD requirement, and the source being the network)

  • There is no SAX, since I need to get only a small subset of the data.

  • StaX might be a way, but I'm not sure if this is the fastest way.

  • JAXB appeared as another option - but which of the parsers does it use? I read that it uses Xerces by default (which type is push or pull?), Although I can configure it for use with Stax or Woodstock according to this link

I read a lot, still confused with a lot of options! Any help would be appreciated.

Thanks!

Edit: here I want to add another question: what's wrong with using JAXB here?

+6
source share
5 answers

The fastest solution to date is the StAX parser, especially because you only need a specific subset of the XML file, and you can easily ignore everything that you really don’t need using StAX, while you would get the event anyway. if you used a SAX analyzer.

But it is also a bit more complicated than using SAX or DOM. The other day, I had to write a StAX parser for the following XML:

<?xml version="1.0"?> <table> <row> <column>1</column> <column>Nome</column> <column>Sobrenome</column> <column> email@gmail.com </column> <column></column> <column>2011-06-22 03:02:14.915</column> <column>2011-06-22 03:02:25.953</column> <column></column> <column></column> </row> </table> 

Here's what the latest parser code looks like:

 public class Parser { private String[] files ; public Parser(String ... files) { this.files = files; } private List<Inscrito> process() { List<Inscrito> inscritos = new ArrayList<Inscrito>(); for ( String file : files ) { XMLInputFactory factory = XMLInputFactory.newFactory(); try { String content = StringEscapeUtils.unescapeXml( FileUtils.readFileToString( new File(file) ) ); XMLStreamReader parser = factory.createXMLStreamReader( new ByteArrayInputStream( content.getBytes() ) ); String currentTag = null; int columnCount = 0; Inscrito inscrito = null; while ( parser.hasNext() ) { int currentEvent = parser.next(); switch ( currentEvent ) { case XMLStreamReader.START_ELEMENT: currentTag = parser.getLocalName(); if ( "row".equals( currentTag ) ) { columnCount = 0; inscrito = new Inscrito(); } break; case XMLStreamReader.END_ELEMENT: currentTag = parser.getLocalName(); if ( "row".equals( currentTag ) ) { inscritos.add( inscrito ); } if ( "column".equals( currentTag ) ) { columnCount++; } break; case XMLStreamReader.CHARACTERS: if ( "column".equals( currentTag ) ) { String text = parser.getText().trim().replaceAll( "\n" , " "); switch( columnCount ) { case 0: inscrito.setId( Integer.valueOf( text ) ); break; case 1: inscrito.setFirstName( WordUtils.capitalizeFully( text ) ); break; case 2: inscrito.setLastName( WordUtils.capitalizeFully( text ) ); break; case 3: inscrito.setEmail( text ); break; } } break; } } parser.close(); } catch (Exception e) { throw new IllegalStateException(e); } } Collections.sort(inscritos); return inscritos; } public Map<String,List<Inscrito>> parse() { List<Inscrito> inscritos = this.process(); Map<String,List<Inscrito>> resultado = new LinkedHashMap<String, List<Inscrito>>(); for ( Inscrito i : inscritos ) { List<Inscrito> lista = resultado.get( i.getInicial() ); if ( lista == null ) { lista = new ArrayList<Inscrito>(); resultado.put( i.getInicial(), lista ); } lista.add( i ); } return resultado; } } 

The code itself is in Portuguese, but you should understand what it is, here is the github repo .

+6
source

If you are extracting only a small amount, consider using XPath, as this is somewhat easier than trying to extract the entire document.

+4
source

Note. I am EclipseLink JAXB (MOXy) , and a member of the JAXB 2 Expert Group ( JSR-222 ).

StAX ( JSR-173) is usually the fastest way to parse XML, and Woodstox knows that it is a fast StAX parser. In addition to parsing, you need to collect XML data. It combines a combination of StAX and JAXB.

For our JAXB implementation to use the Woodstox StAX implementation. Set up your Woodstox usage environment (as easy as adding Woodstox to your class path). Create an instance of XMLStreamReader and pass it as the source that JAXB should undo.

+2
source

Any SAX or StAX can handle this with some complicated work, figuring out that you are in something you want, but to extract a small set of things along an explicit path, you might be better off with XPath .

Another potential tactic is to first filter only the parts that you want to use XSLT , and then analyze whatever you want, the resulting filter will be a much smaller document.

+1
source

I think you should use SAX or a SAX based parser. I would recommend Apache Digester to you. SAX is event driven and stateless. This is what you need here because you only need to extract a small part of the document (I think one tag).

0
source

Source: https://habr.com/ru/post/895020/


All Articles