Is it possible to ignore an invalid XML character with Scala's built-in xml handlers?

I have an xml file (from the federal government data.gov) that I am trying to read using scala xml handlers.

val loadnode = scala.xml.XML.loadFile(filename) 

There appears to be an invalid xml character. Is it possible to simply ignore invalid characters? or is my only option to clear it first?

 org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x12) was found in the element content of the document. 

Ruby nokogiri was able to parse it with an invalid character.

+4
source share
3 answers

I really wonder if 0x12 is indeed even in XML 1.1. See This Summary for Comparison with 1.1 Differences. In particular:

In addition, XML 1.1 allows you to have control characters in your documents using the nature of Recommendations. This concerns control characters # x1 - # x1F, most of which are prohibited in XML 1.0. This means that your document may now include a bell symbol, for example:, however, you still cannot these symbols appear directly in your documents; this violates the definition of the mime type used for XML (text / xml).

Xerces can parse XML 1.1, but it seems that the  instead of the true 0x12 character:

 val s = "<?xml version='1.1'?><root>\u0012</root>" // causes An invalid XML character (Unicode: 0x12) //XML.loadXML(xml.Source.fromString(s), XML.parser) val u = "<?xml version='1.1'?><root>&#18;</root>" val v = XML.loadXML(xml.Source.fromString(u), XML.parser) println(v) // works 

As suggested by lavinio, you can filter out invalid characters. This does not take up too many lines in Scala:

 val in = new InputStream { val in0 = new FileInputStream("invalid.xml") override def read():Int = in0.read match { case 0x12=> read() case x=> x} } val x = XML.load(in) 
+5
source

To extend the answer to @huynhjl: the InputStream filter is dangerous if you have multibyte characters, for example, in UTF-8 encoded text. Instead, use a character-oriented filter: FilterReader . Or, if the file is small enough, load it into a String and replace the characters there.

 scala> val origXml = "<?xml version='1.1'?><root>\u0012</root>" origXml: java.lang.String = <?xml version='1.1'?><root></root> scala> val cleanXml = xml flatMap { case x if Character.isISOControl(x) => "&#x" + Integer.toHexString(x) + ";" case x => Seq(x) } cleanXml: String = <?xml version='1.1'?><root>&#x12;</root> scala> scala.xml.XML.loadString(cleanXml) res14: scala.xml.Elem = <root></root> 
+10
source

0x12 is valid only in XML 1.1. If your XML file claims this version, you can enable 1.1 processing support in your SAX parser.

Otherwise, the underlying parser is probably Xerces, which, like the corresponding XML parser, complains properly.

If you must handle these streams, I would write an InputStream or Reader wrapper around my input file, filter out characters with invalid Unicode values, and pass the rest.

+3
source

Source: https://habr.com/ru/post/1303623/


All Articles