R: Unable to read text files in Unicode even when specifying an encoding

I am using R 3.1.1 for Windows 7 32bits. I have many problems reading some text files on which I want to perform text analysis. According to Notepad ++, files are encoded using "UCS-2 Little Endian" . (grepWin, a tool whose name says everything, says the file is "Unicode.")

The problem is that I cannot read the file, even indicating that it is encoding. (Symbols have a standard Spanish Latin set - and should be easily handled with CP1252 or something like that.)

> Sys.getlocale() [1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252" > readLines("filename.txt") [1] "ÿþE" "" "" "" "" ... > readLines("filename.txt",encoding="UTF-8") [1] "\xff\xfeE" "" "" "" "" ... > readLines("filename.txt",encoding="UCS2LE") [1] "ÿþE" "" "" "" "" "" "" ... > readLines("filename.txt",encoding="UCS2") [1] "ÿþE" "" "" "" "" ... 

Any ideas?

Thanks!!


edit: bypasses "UTF-16", "UTF-16LE" and "UTF-16BE" are not performed similarly

+5
source share
1 answer

After a more detailed study of the documentation, I found the answer to my question.

The encoding readLines parameter applies only to paramagent input lines . The documentation states:

to enter lines. It is used to indicate a character as is known in Latin-1 or UTF-8: it is not used to transcode an input. To do the latter, specify the encoding as part of the con connection or through the options (encoding =): see examples. See also "Details.

The correct way to read a file with unusual encoding is

 filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE")) close(con) 
+7
source

Source: https://habr.com/ru/post/1204449/


All Articles