I am working on an RTF parser and am having difficulty handling unicode.
The RTF specification states that "Unicode values greater than 32767 must be expressed as negative numbers" ( http://www.biblioscape.com/rtf15_spec.htm#Heading9 ), and to get a unicode numerical value adds 65536 to these negative numbers.
I tested this scenario by installing a document with the Unicode character 32767 and 32768. Word (v2011 on Mac) creates the following RTF syntax for these two characters:
\u32767\'5f\loch\af556\hich\af31506\dbch\f556 \uc2\u-32768\'97\'73
For the second, -32768 + 65536 - 32768, as expected. Therefore, the commands \ uNNNN make sense.
My problem is with text escape sequences like \ '97 \ '73 at the end. I do not understand why this is. I could code my parser to ignore the commands that are attached to the end of the \ uNNNN command. But I compared TextEdit with the RTF output, and it only outputs text escape sequences:
\uc0\u32767 \'97\'73
This seems to be an attempt to be a double unicode escape sequence. And this kind of text leak in hexadecimal. But 0x9773 is 38771, not 32768, so I don’t understand how I can extract the desired Unicode value from this data. Any ideas?
Update. I conducted some additional tests to see how TextEdit handles character codes 32767 - 32777. They look like this: RTF:
\u32767
\'97\'73
\'98\'56
\u32770
\'8d\'6c
\'e3\'cc
\'8e\'d2
\'e3\'cb
\u32775
\u32776
\'c2\'56
RTF TextEdit, Word, . .