How to parse XML with German umlauts! names?

I am trying to parse XML in java on

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(msg.getBytes("UTF-8")));

Everything seems to be fine, too

<data>Àâü</data>

correctly understood (especially regarding German umlauts).

But when I try to make out

<dΓ€ta>xxx</dΓ€ta>

the parser throws an exception, names with umlauts in it don't seem to work:

org.w3c.dom.DOMException: WFΓ€ at org.apache.harmony.xml.dom.NodeImpl.setName(NodeImpl.java:286) at
org.apache.harmony.xml.dom.AttrImpl.<init>(AttrImpl.java:55) at 
org.apache.harmony.xml.dom.DocumentImpl.createAttribute(DocumentImpl.java:324) at 
org.apache.harmony.xml.parsers.DocumentBuilderImpl.parse(DocumentBuilderImpl.javβ€Œβ€‹a:314) at 
org.apache.harmony.xml.parsers.DocumentBuilderImpl.parse(DocumentBuilderImpl.javβ€Œβ€‹a:321) at 
org.apache.harmony.xml.parsers.DocumentBuilderImpl.parse(DocumentBuilderImpl.javβ€Œβ€‹a:128)
+4
source share
1 answer

According to the XML tag name specification, valid characters are:

":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
| "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

char 'Γ€' is 0x00E4 and therefore resides in a block [#xD8-#xF6]and is valid in tag names. Drop your XML parser; -)

+3
source

Source: https://habr.com/ru/post/1569087/


All Articles