Check if the document is correctly formed before parsing

I need to parse several thousand XML documents to make sure that some of them contain a specific construct. The problem is that some of the documents do not contain well-formed XML.

The main idea was to use fn:collection()and search inside returned nodes. But this only works if all the documents in the collection are well-formed.

Is it possible to do something similar, but only parse well-formed documents?

This is my XSLT, simplified, which works if all the documents are $dirwell-formed:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xsl:output method="text"/>
  <xsl:variable name="dir" as="xs:string">file:/c:/path/to/files/</xsl:variable>
  <xsl:variable name="files" select="concat($dir, '?select=*.xml')" as="xs:string"/>

  <xsl:template match="/">
    <xsl:variable name="docs" select="collection($files)"/>
    <xsl:variable name="names" select="
      for $i in $docs return
        distinct-values($i//*[exists(@an-attribute-to-find)]/local-name())"/>
    <xsl:value-of select="distinct-values($names)" separator="&#x0a;"/>
  </xsl:template>

</xsl:stylesheet>

- ? , ?

+3
3

XSLT.

XSLT, exrternal (<xsl:param>) , , - XPath 2.0 doc-available() , .

+3

TagSoup, .

Saxon, TagSoup , :

... Saxon -x org.ccil.cowan.tagsoup.Parser , , , TagSoup Java.

+2

You can use the doc-available function to tell you if the document is correctly formed.

+2
source

Source: https://habr.com/ru/post/1764488/


All Articles