My system: win7 + R-3.0.2.
> Sys.getlocale() [1] "LC_COLLATE=Chinese (Simplified)_People Republic of China.936;LC_CTYPE=Chinese (Simplified)_People Republic of China.936;LC_MONETARY=Chinese (Simplified)_People republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People Republic of China.936"
There are two files with the same contents saved in Microsoft Notepad: one is saved as ansi format, the other is saved as utf8 format. Data is the name of the death in M370 Malaysia Airlines. Or you can create a file this way.
1) copy the data to Microsoft Notepad.
乘客姓名,性别,出生日期HuangTianhui,男,1948/05/28姜翠云,女,1952/03/27李红晶,女,1994/12/09
2) save it as test.ansi with ansi format in notepad.
3) save it as test.utf8 with utf-8 format in notepad.
read.table("test.ansi",sep=",",header=TRUE) #can work fine read.table("test.utf8",sep=",",header=TRUE) #can't work
Then I set the encoding to utf-8.
options(encoding="utf-8") read.table("test.utf8",sep=",",header=TRUE,encoding="utf-8") In read.table("test.utf8", sep = ",",header=TRUE,encoding = "utf-8") : invalid input found on input connection 'test.utf8'
How can I read the data file (test.utf8)?
In python, it's that simple
rfile=open("g:\\test.utf8","r",encoding="utf-8").read() rfile '\ufeff乘客姓名,性别,出生日期\n\nHuangTianhui,男,1948/05/28\n\n姜翠云,女,1952/03 /27\n\n李红晶,女,1994/12/09' rfile.replace("\n\n","\n").replace("\ufeff","").splitlines() ['乘客姓名,性别,出生日期', 'HuangTianhui,男,1948/05/28', '姜翠云,女,1952/03/27', '李红晶,女,1994/12/09']
Python can do this job better than R.
I do, as Safish says, the problem is solved a bit, still remains. I found that when the data is in data.frame, it cannot be displayed correctly,
when the data is a data.frame column, it may display correctly,
rather strange, when the data is a data.frame string, it cannot be displayed correctly.

