Java error? Can't read GB2312 file with Scanner directly

I have a file encoded in GB3212 (Chinese). The file is downloaded from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO , as with wget under Windows, and saved in the file ModernChineseCharacterFrequencyList.html.

The code below shows how Java cannot read it to the end in one way and can with another.

Namely, if the Scanner is created using scanner = new Scanner(src, "GB2312") , the code does not work. And if the Scanner is created using scanner = new Scanner(new FileInputStream(src), "GB2312") , then it works.

Separator pattern strings simply show another option that remains unsuccessful.

 public static void main(String[] args) throws FileNotFoundException { File src = new File("ModernChineseCharacterFrequencyList.html"); //Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>"); Scanner scanner; String line; //scanner = new Scanner(src, "GB2312"); // does NOT work scanner = new Scanner(new FileInputStream(src), "GB2312"); // does work //scanner.useDelimiter(frequencyDelimitingPattern); while(scanner.hasNext()) { line = scanner.next(); System.out.println(line); } } 

Is this a malfunction or behavioral behavior?

UPDATE

When the code works, it just reads all the tokens to the end. When it does NOT work, it cancels reading approximately in the middle with no exception or error message.

No features were found at the fault site. There were also no β€œmagic” numbers, such as 2 ^ 32.

UPDATE 2

This behavior was originally discovered on Windows with Sun JavaSE 1.6.

And now the same behavior is observed on Ubuntu with OpenJDK 1.6.0_23

+4
source share
1 answer

I can't check my answer right now, but the JDK 6 documentation offers different canonical names for links depending on the API you use: io or nio

JDK 6 Encondings Support

Perhaps instead of using "GB2312" you should use "EUC_CN", which is the suggested canonical name for Java I / O.

+1
source

Source: https://habr.com/ru/post/1389462/


All Articles