XML / Java: Exact position of string and character when parsing tags and attributes?

I am trying to find a way to accurately determine the line number and character position of both tags and attributes when parsing an XML document. I want to do this so that I can accurately tell the author of the XML document (via the web interface) where the document is invalid.

Ultimately, I want to set the caret in a to be in an invalid tag or just inside an open quote from an invalid attribute. (I do not use an XML schema at this point because the exact format of the attributes has a value that cannot be verified only by the schema. I might even want to report some attributes as invalid partial paths through the attribute value. Or similarly, part through text between the start and end tags.)

Ive tried using SAX (org.xml.sax) and the Locator interface. This works up to a point, but not good enough. It will only report the reading position after the event; for example, the character immediately after the end of an open tag, for startElement (). I can’t just subtract the length of the tag name because attributes, self-closing tags and / or newlines in the open tag are thrown away. (And the Locator does not provide any information on the position of attributes at all.)

Ideally, I wanted to use an event-based approach, as I already have a SAX handler that creates my own DOM representation or additional processing. However, I would be interested to know about any DOM or DOM-like library that contains accurate location information for model elements.

Has anyone solved this problem or any other with the required level of accuracy?

+5
source share
2 answers

XML parsers will (and should) smooth out some things, such as extra spaces, so it’s not possible to accurately map back to a character stream.

You should better study getting a lexer or token generator for more details, in other words, go to the level of detail below the XML parsers.

There are several general frameworks for writing lexers in java. This is the ANTLR 3 page with a good overview of lexer vs parser and in section 1 some examples of the Rudimentory XML document.

I would also like to comment that for a user with a web interface, perhaps you should consider a clean client solution (e.g. javascript).

+2
source

I wrote a quick xml file that receives line numbers and throws an exception in case of an unwanted attribute and gives the text in which the error was thrown.

import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.util.Stack; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.apache.log4j.Logger; import org.w3c.dom.Document; import org.xml.sax.Attributes; import org.xml.sax.Locator; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; public class LocatorTestSAXReader { private static final Logger logger = Logger.getLogger(LocatorTestSAXReader.class); private static final String XML_FILE_PATH = "lib/xml/test-instance1.xml"; public Document readXMLFile(){ Document doc = null; SAXParser parser = null; SAXParserFactory saxFactory = SAXParserFactory.newInstance(); try { parser = saxFactory.newSAXParser(); DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (ParserConfigurationException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (SAXException e) { // TODO Auto-generated catch block e.printStackTrace(); } StringBuilder text = new StringBuilder(); DefaultHandler eleHandler = new DefaultHandler(){ private Locator locator; @Override public void characters(char[] ch, int start, int length){ String thisText = new String(ch, start, length); if(thisText.matches(".*[a-zA-z]+.*")){ text.append(thisText); logger.debug("element text: " + thisText); } } @Override public void setDocumentLocator(Locator locator){ this.locator = locator; } @Override public void startElement(final String uri, final String localName, final String qName, final Attributes attributes) throws SAXException { int lineNum = locator.getLineNumber(); logger.debug("I am now on line " + lineNum + " at element " + qName); int len = attributes.getLength(); for(int i=0;i<len;i++){ String attVal = attributes.getValue(i); String attName = attributes.getQName(i); logger.debug("att " + attName + "=" + attVal); if(attName.startsWith("bad")){ throw new SAXException("found attr : " + attName + "=" + attVal + " that starts with bad! at line : " + locator.getLineNumber() + " at element " + qName + "\nelement occurs below text : " + text); } } } }; try { parser.parse(new FileInputStream(new File(XML_FILE_PATH)), eleHandler); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (SAXException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } return doc; } } 

in relation to text, depending on where the error occurs in the xml file, there can be no text. So with this xml:

 <?xml version="1.0"?> <root> <section> <para>This is a quick doc to test the ability to get line numbers via the Locator object. </para> </section> <section bad:attr="ok"> <para>another para.</para> </section> </root> 

if bad attr is in the first element, the text will be empty. In this case, the exception was:

 org.xml.sax.SAXException: found attr : bad:attr=ok that starts with bad! at line : 6 at element section element occurs below text : This is a quick doc to test the ability to get line numbers via the Locator object. 

When you say you tried to use the Locator object, what exactly was the problem?

0
source

Source: https://habr.com/ru/post/1263641/


All Articles