How can I read Chinese characters correctly using a scanner in Java?

Programming language: Java Task: creating a hash function that maps Chinese strings to numbers Problem: reading and displaying Chinese characters correctly

This is a homework question, but I am not asking how to do it, just having problems with the implementation of reading Chinese characters.

A brief description of my task: to create a hash function for matching the (Chinese) names of students in our class with their student identifiers and other satellite data (gender, phone, etc.).

I still think about it, but like other languages, the scope of me includes the use of character encoding of a character, through a hash function, come up with a unique value, if I'm not mistaken,

Here is what I should check for the validity of this thought movement:

// test whether console can read chinese characters Scanner s = new Scanner(System.in); System.out.print("Please enter a Chinese character: "); int chi = (int)s.next().toCharArray()[0]; System.out.println("\nThe string entered is " + chi); 

If I use the simple System.out.println ("character") statement, the correct character is displayed.

But as seen above, if I use Scanner to read input, I tried to convert String to a char array, and then its equivalent is un unicode, but it calls a ridiculous number and I can’t display it correctly.

I understand that I can just use this erroneous value to create a hash function, but in order not to create possible collisions (I don’t know if they produce UNIQUE erroneous values) and for the sake of training, could you indicate how I can combine entering Chinese characters on different machines?

Always grateful for your thoughts .: D

Baggio.

+4
source share
3 answers

You think too much about it. Each String already (conceptually) a sequence of characters, including hieroglyphs. Encoding comes only when you need to convert it to bytes that you do not need for your purpose. Just use the String hash. In fact, when you create a HashMap<String,YourObject> , this is exactly what will happen behind the scenes.

+1
source

When creating a scanner, you can also specify which character encoding to use. Here is the documentation.

+3
source

If you are not using basic ASCII characters, you need to consider what character set you are using. Most often it will be UTF-8, but other character sets can be used.

It should be borne in mind that the size of a character other than ASCII can exceed 1 byte. This applies to Chinese characters.

When working with multi-byte characters, you will need to think in terms of code points (which are integers representing the UTF-8 character), instead of single-byte characters.

Newer versions of Java allow you to iterate over a string using code points. Take a look at the Java API for String.

+3
source

Source: https://habr.com/ru/post/1439844/


All Articles