Java, JavaCC: how to parse characters outside BMP?

I am referring to the XML 1.1 specification .

Take a look at the definition of NameStartChar :

NameStartChar ::= ":" | [AZ] | "_" | [az] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

If I interpret this correctly, the last range ( #x10000-#xEFFFF ) is outside the range of UTF16 like Java char . So it must be UTF32 , right? So, I need to check char pairs for this range, instead of a single char s, right?

My questions:

  • How to check for such character ranges using standard Java methods?
  • How can you define such ranges in JavaCC?
    • JavaCC complains about \u10000 and \uEFFFF

Thanks!

NOTE: Don’t worry, I'm not trying to write my own XML parser. EDIT: I am writing a parser that would check if text entered from different text formats (not XML) matches valid XML names.

+4
source share
2 answers

Take a look at Character.toCodePoint(char, char) , which converts a surrogate pair to a full range code point. String.codePointAt may also be useful to you.

There are many other surrogate support inside the character and line. To know exactly which methods to call, we need to know the exact details of your situation.

+3
source

I found http://www.fileformat.info/info/unicode/char/10000/index.htm to be a convenient site for learning Unicode characters.

For example, u + 10000 and u + 10FFFF are

 String first = "\uD800\uDC00"; // u10000 String last = "\uDBFF\uDFFF"; // u10FFFF 
0
source

Source: https://habr.com/ru/post/1310353/


All Articles