I have a file encoded in GB3212 (Chinese). The file is downloaded from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO , as with wget under Windows, and saved in the file ModernChineseCharacterFrequencyList.html.
The code below shows how Java cannot read it to the end in one way and can with another.
Namely, if the Scanner is created using scanner = new Scanner(src, "GB2312") , the code does not work. And if the Scanner is created using scanner = new Scanner(new FileInputStream(src), "GB2312") , then it works.
Separator pattern strings simply show another option that remains unsuccessful.
public static void main(String[] args) throws FileNotFoundException { File src = new File("ModernChineseCharacterFrequencyList.html"); //Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>"); Scanner scanner; String line; //scanner = new Scanner(src, "GB2312"); // does NOT work scanner = new Scanner(new FileInputStream(src), "GB2312"); // does work //scanner.useDelimiter(frequencyDelimitingPattern); while(scanner.hasNext()) { line = scanner.next(); System.out.println(line); } }
Is this a malfunction or behavioral behavior?
UPDATE
When the code works, it just reads all the tokens to the end. When it does NOT work, it cancels reading approximately in the middle with no exception or error message.
No features were found at the fault site. There were also no βmagicβ numbers, such as 2 ^ 32.
UPDATE 2
This behavior was originally discovered on Windows with Sun JavaSE 1.6.
And now the same behavior is observed on Ubuntu with OpenJDK 1.6.0_23
source share