JAVA: collecting byte offsets of xml tags using XmlStreamReader

Is there a way to accurately collect byte offsets of xml tags using XMLStreamReader?

I have a large XML file for which I need random access. Instead of writing all this to a database, I would like to skip it once with XMLStreamReader to collect byte offsets of meaningful tags, and then be able to use RandomAccessFile to retrieve the contents of the tag later.

XMLStreamReader does not seem to be able to track character offsets. Instead, people recommend attaching an XmlStreamReader to a reader that keeps track of how many bytes have been read (e.g. CountingInputStream provided by apache.commons.io)

eg:

CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;


while (xmlStreamReader.hasNext()) {
    int eventCode = xmlStreamReader.next();

    switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
    }

}
xmlStreamReader.close();

, , . xml ( , XML)?

+3
5

getLocation() XMLStreamReader ( XMLEvent.getLocation(), XMLEventReader), , - , . , , .

, , , , .

+2

, -, ?

+1

Aalto LocationInfo.

ximpleware Java VTD-XML, 2.11 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ , getChar() IReader.

IReader caracter VTDGen.java VTDGenHuge.java

IReader

ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;   
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

IReader getCharOffset() charCount VTDGen VTDGenHuge getChar() skipChar() IReader .

+1

, . switch , .

        switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " end@" + xmlStreamReader.getLocation().getCharacterOffset()) ;
        }

, , JAR .

( , , XMLStreamReader), , .

, !

0

XML java?. , ANTLR XML-Parser.

0

Source: https://habr.com/ru/post/1753088/


All Articles