Unicode text output to RTF file in python

I am trying to output unicode text to an RTF file from a python script. For background, Wikipedia says

To output Unicode, the control word \ u is used, followed by a 16-bit signed decimal integer indicating the UTF-16 Unicode code number. For programs without Unicode support, this should be accompanied by the closest representation of this character on the specified code page. For example, \ u1576? give an Arabic letter to bā 'ب, indicating that older programs that do not support Unicode should do so as a question mark.

There is also this question when outputting RTF from Java, and this one is in C # .

However, I cannot figure out how to output the Unicode code code as "16-bit decimal integer with Unicode UTF-16 sign" from Python. I tried this:

for char in unicode_string: print '\\' + 'u' + ord(char) + '?', 

but the output only appears as gibberish when opened in a word processor; the problem is that it is not a UTF-16 code number. But not sure how to get it; although it can be encoded in utf-16, how to get the code?

By the way, PyRTF does not support unicode (it is listed as "todo"), and while pyrtf-NG should do this, this project does not seem to be supported and has a little documentation, so I am afraid to use it in a quasi-production system.

Edit: My mistake. There are two errors in the above code: as indicated by Wobble below, the string should be a unicode string, not already encoded, and the above code creates a result with spaces between characters. The correct code is:

 convertstring="" for char in unicode(<my_encoded_string>,'utf-8'): convertstring = convertstring + '\\' + 'u' + str(ord(char)) + '?' 

This works great, at least with OpenOffice. I leave this here as a link for others (one error is further corrected after discussion below).

+4
source share
2 answers

Based on the information in your last editor, I think this feature will work correctly. Also see the enhanced version below.

 def rtf_encode(unistr): return ''.join([c if ord(c) < 128 else u'\\u' + unicode(ord(c)) + u'?' for c in unistr]) >>> test_unicode = u'\xa92012' >>> print test_unicode ©2012 >>> test_utf8 = test_unicode.encode('utf-8') >>> print test_utf8 ©2012 >>> print rtf_encode(test_utf8.decode('utf-8')) \u169?2012 

Here's another version that broke down a bit to make it easier to understand. I also made it consistent in returning an ASCII string, not in saving Unicode and tricking it into join . It also contains a comment based fix.

 def rtf_encode_char(unichar): code = ord(unichar) if code < 128: return str(unichar) return '\\u' + str(code if code <= 32767 else code-65536) + '?' def rtf_encode(unistr): return ''.join(rtf_encode_char(c) for c in unistr) 
+2
source

Mark Ransom's answer is not entirely correct, as it will not correctly encode U + 7fff codes and will not output characters below 0x20, as recommended by the RTF standard.

I created a simple module that encodes python unicode code into RTF control codes called rtfunicode and wrote about it on my blog .

Thus, my method uses a regular expression to match the correct code points with RTF control codes suitable for inclusion in PyRTF or pyrtf-ng.

+1
source

Source: https://habr.com/ru/post/1403908/


All Articles