Reading a strange unicode character in Java?

Question

Reading a strange unicode character in Java?

I have the following text file:

enter image description here

The file was saved using utf-8 encoding.

I used the following code to read the contents of a file:

FileReader fr = new FileReader("f.txt"); BufferedReader br = new BufferedReader(fr); String s1 = br.readLine(); String s2 = br.readLine(); System.out.println("s1 = " + s1.length()); System.out.println("s2 = " + s2.length());

conclusion:

 s1 = 5 s2 = 4

Then I tried to use s1.charAt(0); to get the first character s1, and that was the character '' (empty). Therefore, s1 has a length of 5. Even if I tried to use s1.trim(); its length is still 5. I do not know why this happened? It worked correctly if the file was saved with ASCII encoding.

+4

java file-io unicode

ipkiss Mar 27 '12 at 11:54

source share

5 answers

Michael borgwardt · Answer 1 · 2012-03-27T11:58:00+0000

Notepad apparently saved a file with an order byte , a non-printable character at the beginning that just marks it as UTF-8, but not required (and not recommended). You can ignore or delete it; other text editors often give you the choice of using UTF-8 with or without specification.

Björn · Answer 2 · 2012-03-27T11:57:40+0000

In fact, this is not an empty character, this is a specification - "Byte Estimation" . Windows uses the specification to mark files as Unicode encoded files (UTF-8, UTF-16, and UTF-32).

I think that you can save files without specification even in Notepad (in fact, this is not required).

Edwin dalorzo · Answer 3 · 2012-03-27T12:47:36+0000

Well, you can try reading your file using a different encoding.

You need to use the OutputStreamReader class as the read parameter for your BufferedReader . It accepts an encoding. Check out the Java Docs for this.

Somewhat:

 BufeferedReader out = new BufferedReader(new OutputStreamReader(new FileInputStream("jedis.txt),"UTF-8")))

Or you can set the current system encoding with the file.encoding system property to UTF-8.

 java -Dfile.encoding=UTF-8 com.jediacademy.Runner arg1 arg2 ...

You can also set it as a system property at runtime with System.setProperty(...) if it is needed only for that specific file, but in that case, I think I would prefer OutputStreamWriter .

By setting the system property, you can use FileReader and expect it to use UTF-8 as the default encoding for your files. In this case, for all files that you read and write.

If you intend to detect decoding errors in your file, you will have to use the OutputStreamReader approach and use the constructor that the decoder receives.

Somewhat like

 CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder(); decoder.onMalformedInput(CodingErrorAction.REPORT); decoder.onUnmappableCharacter(CodingErrorAction.REPORT); BufeferedReader out = new BufferedReader(new InputStreamReader(new FileInputStream("jedis.txt),decoder));

You can choose between IGNORE | REPLACE | REPORT IGNORE | REPLACE | REPORT

Oofpez · Answer 4 · 2012-03-27T12:10:51+0000

null character, for example. when you use (char) 0 it translates to ''

Perhaps filenerader reads the null character at the beginning of the file. I'm not sure why, though ...

Stephen c · Answer 5 · 2012-03-27T12:18:15+0000

Even if I tried to use s1.trim (); its length is another 5.

I expect you to do this:

  s1.trim();

It does not do what you want. Java strings are immutable, and the trim() method creates a new string ... which you then throw away. You need to do this:

  s1 = s1.trim();

... which assigns a link to a new line created by trim() , for something so you can use it.

(Note: trim() does not always create a new line. If the source line does not have a leading or trailing space, the trim() method simply returns it as-is.)

Reading a strange unicode character in Java?

More articles: