Unicode parsing with values> 32767 from RTF files

I am working on an RTF parser and am having difficulty handling unicode.

The RTF specification states that "Unicode values ​​greater than 32767 must be expressed as negative numbers" ( http://www.biblioscape.com/rtf15_spec.htm#Heading9 ), and to get a unicode numerical value adds 65536 to these negative numbers.

I tested this scenario by installing a document with the Unicode character 32767 and 32768. Word (v2011 on Mac) creates the following RTF syntax for these two characters:

\u32767\'5f\loch\af556\hich\af31506\dbch\f556 \uc2\u-32768\'97\'73

For the second, -32768 + 65536 - 32768, as expected. Therefore, the commands \ uNNNN make sense.

My problem is with text escape sequences like \ '97 \ '73 at the end. I do not understand why this is. I could code my parser to ignore the commands that are attached to the end of the \ uNNNN command. But I compared TextEdit with the RTF output, and it only outputs text escape sequences:

\uc0\u32767 \'97\'73

This seems to be an attempt to be a double unicode escape sequence. And this kind of text leak in hexadecimal. But 0x9773 is 38771, not 32768, so I don’t understand how I can extract the desired Unicode value from this data. Any ideas?

Update. I conducted some additional tests to see how TextEdit handles character codes 32767 - 32777. They look like this: RTF:

\u32767 
\'97\'73
\'98\'56
\u32770 
\'8d\'6c
\'e3\'cc
\'8e\'d2
\'e3\'cb
\u32775 
\u32776 
\'c2\'56

RTF TextEdit, Word, . .

+4
1

\u RTF Unicode ASCII . RTF-, \u. , , \uc. Unicode , \uc 2 . RTF , . , .

0

Source: https://habr.com/ru/post/1535125/


All Articles