Parsing XHTML5 in an XDocument

I need to parse XHTML5 files in instances XDocument. My files will always be well-formed XML, so I want to avoid the HtmlAgilityPack due to its resolvability of corrupted XHTML. The method XDocument.Loadworks for simple cases, but is interrupted when the document contains named characters (entities):

var xhtml = XDocument.Load(reader);
// XmlException: Reference to undeclared entity 'nbsp'. 

For XHTML 1.0, this problem can be solved with the help of XmlPreloadedResolverpreloading the well-known DTDs that are defined in XHTML 1.0. This approach can be extended to support XHTML 1.1 by manually providing DTDs, as shown in this answer .

However, XHTML5 does not have a DTD, as discussed in this other answer . Definitions of its essence are provided for informational purposes as JSON .

<!DOCTYPE html>

Therefore, methods are XmlResolvernever called when parsing objects in XHTML5. There is a discussion of attempts to provide a XmlReaderlist of entity declarations , but no approach works out of the box.

There are currently two approaches that I'm looking at. The first indicates an internal subset with entity declarations in the document type declaration, either by manipulating strings in the original XHTML, or through XmlParserContext.InternalSubset. This will result in a document type declaration similar to:

<!DOCTYPE html [
  <!ENTITY ndash "&#8211;">
  <!ENTITY nbsp "&#160;">
  ...
]>

, XHTML5; , XDocument ( 2000), , .

, XHTML regex, ( ), XML, " & ' < >. , , XML , . , , , CDATA . , , .

- , ? , XmlReader, , .

+4
1

, . ( ) , , .

:

<!DOCTYPE foo [
 <!ENTITY ndash "&#8211;">
 <!ENTITY nbsp "&#160;">
]>
<foo>
  <p>I am &ndash; and I am&nbsp;non-breaking space.</p>
</foo>

:

        <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        version="1.0">
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>

:

<foo>
   <p>I am – and I am non-breaking space.</p>
</foo>

, :

<!ENTITY % winansi SYSTEM "path/to/my/map/winansi.xml">  %winansi;]>
+1

Source: https://habr.com/ru/post/1629670/


All Articles