Writing a unicode file to rtf

I am trying to write lines in different languages ​​to an rtf file. I tried a few different things. I use japanese here as an example, but it is the same for other languages ​​that I have tried.

public void writeToFile(){ String strJapanese = "ζ—₯本θͺž"; DataOutputStream outStream; File file = new File("C:\\file.rtf"); try{ outStream = new DataOutputStream(new FileOutputStream(file)); outStream.writeBytes(strJapanese); outStream.close(); }catch (Exception e){ System.out.println(e.toString()); } } 

I tried:

 byte[] b = strJapanese.getBytes("UTF-8"); String output = new String(b); 

Or more specifically:

 byte[] b = strJapanese.getBytes("Shift-JIS"); String output = new String(b); 

The output stream also has a writeUTF method:

 outStream.writeUTF(strJapanese); 

You can use byte [] directly in the output stream using the write method. All of the above gives me distorted characters for everything except Western European languages. To find out if this works, I tried to open the result document in notepad ++ and set the appropriate encoding. I also used OpenOffice, where you can choose the encoding and font when opening the document.

If this works, but my computer cannot open it correctly, is there any way to verify this?

+2
source share
3 answers

By default, strings in JAVA are in UTF-8 (unicode), but if you want to write it, you need to specify the encoding

 try { FileOutputStream fos = new FileOutputStream("test.txt"); Writer out = new OutputStreamWriter(fos, "UTF8"); out.write(str); out.close(); } catch (IOException e) { e.printStackTrace(); } 

ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html

+3
source

DataOutputStream outStream;

You probably don't want the DataOutputStream to write an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is textual. Typically, an OutputStreamWriter setting the appropriate character set in the constructor will be a way of writing to text files.

outStream.writeBytes (strJapanese);

In particular, this fails because writeBytes does write bytes, even if you pass a string to it. A much more suitable data type would be byte[] , but this is only one of the places where Java handling of bytes and characters gets confused. The way it converts your string to bytes is simply to take the lower eight bits of each block of UTF-16 code and discard the rest. This results in ISO-8859-1 encoding with distorted nonsense for all characters that do not exist in ISO-8859-1.

 byte[] b = strJapanese.getBytes("UTF-8"); String output = new String(b); 

This is really nothing useful. You encode into UTF-8 bytes and decode it back to a string using the default encoding. It is almost always a mistake to touch the default encoding, since it is unpredictable for different machines.

 outStream.writeUTF(strJapanese); 

It would be better to hit UTF-8 spelling, but it's still not quite right as it uses Java encoding with modified UTF-8, and, more importantly, RTF files do not actually support UTF-8 and shouldn't really directly include any non-ASCII character.

Traditionally, characters other than ASCII from 128 to the top should be written as hexadecimal bytes, such as \'80 , and the encoding for them is indicated, if any, in the fonts \fcharset and \cpg screens, which are very, very annoying, and does not offer UTF-8 as an option.

In more modern RTF, you get \u1234x escape sequences, as in Dubbler's answer (+1). Each escape code encodes one UTF-16 code block, which corresponds to Java char , so it’s not too difficult to use a regular expression - replace all non-ASCII characters with their escaped variants.

This is supported by Word 97 and later, but some other tools may ignore Unicode and return to the x replacement character.

RTF is not a very nice format.

+3
source

You can write any Unicode character expressed as its decimal number using the control word \u . For instance. \u1234? will represent a character whose Unicode code point is 1234, ? is a replacement symbol for cases where the symbol cannot be adequately represented (for example, since the font does not contain it).

+2
source

Source: https://habr.com/ru/post/1403910/


All Articles