I am trying to output unicode text to an RTF file from a python script. For background, Wikipedia says
To output Unicode, the control word \ u is used, followed by a 16-bit signed decimal integer indicating the UTF-16 Unicode code number. For programs without Unicode support, this should be accompanied by the closest representation of this character on the specified code page. For example, \ u1576? give an Arabic letter to bā 'ب, indicating that older programs that do not support Unicode should do so as a question mark.
There is also this question when outputting RTF from Java, and this one is in C # .
However, I cannot figure out how to output the Unicode code code as "16-bit decimal integer with Unicode UTF-16 sign" from Python. I tried this:
for char in unicode_string: print '\\' + 'u' + ord(char) + '?',
but the output only appears as gibberish when opened in a word processor; the problem is that it is not a UTF-16 code number. But not sure how to get it; although it can be encoded in utf-16, how to get the code?
By the way, PyRTF does not support unicode (it is listed as "todo"), and while pyrtf-NG should do this, this project does not seem to be supported and has a little documentation, so I am afraid to use it in a quasi-production system.
Edit: My mistake. There are two errors in the above code: as indicated by Wobble below, the string should be a unicode string, not already encoded, and the above code creates a result with spaces between characters. The correct code is:
convertstring="" for char in unicode(<my_encoded_string>,'utf-8'): convertstring = convertstring + '\\' + 'u' + str(ord(char)) + '?'
This works great, at least with OpenOffice. I leave this here as a link for others (one error is further corrected after discussion below).