I'm struggling a bit with this. I have a multilingual web application that outputs XML at some point. This XML can contain any language, so my approach to disinfection is to prohibit some characters that violate the XML from the insert. This is wrapping as much as possible in CDATA, but I have a ton of content in the attributes. I don’t want to prohibit special characters, because all valid characters, such as brackets, periods, dashes, ticks and apostrophes, are used all the time and they work.
What is the best way to cross out all characters that violate the XML attribute, but leave the languages intact?
UPDATE:
I found: http://en.wikipedia.org/wiki/CDATA#CDATA-type_attribute_value , which states that I can describe the attribute as a CDATA section using DTD; however, this is not as it seems.
<?xml version="1.0" ?> <!DOCTYPE foo [ <!ELEMENT foo EMPTY> <!ATTLIST foo a CDATA #REQUIRED> ]> <foo a="•"><![CDATA[ • ]]> </foo>
Any validator will complain that the bull is not an entity in the attribute. If you remove the attribute, it will be valid. Also, I heard that schemas are the way to go, so if something like this is given above, but use an XML schema instead, it will be awesome.
Thanks!
source share