Reading a UTF-8 text file (in Hebrew) displays gibrish in the RStudio console and excellent in RGUI

Question

Reading a UTF-8 text file (in Hebrew) displays gibrish in the RStudio console and excellent in RGUI

I am trying to figure out if this is a bug in RStudio or if I am missing something.

I read the csv file in R. When you print it to the console in RStudio, I get gibrish (unless you look at a specific vector). Although Rgui is fine.

The code I ran is as follows:

Sys.setlocale("LC_ALL", "Hebrew") x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") x # shows gibrish x[,2] colnames(x)

Here is the result of RStudio (gibrish)

 > x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") > x âéì..áùðéí. îéâãø 1 23.0 æëø 2 24.0 ð÷áä 3 23.0 ð÷áä 4 24.0 ð÷áä 5 25.0 æëø 6 18.0 æëø 7 26.0 æëø 8 21.5 ð÷áä 9 24.0 æëø 10 26.0 æëø 11 24.0 æëø 12 19.0 ð÷áä 13 19.0 ð÷áä 14 24.5 æëø 15 21.0 ð÷áä > x[,2] [1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה Levels: זכר נקבה > colnames(x) [1] "âéì..áùðéí." "îéâãø" >

And here he is in Rgui (this is great here):

 > x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") > x # shows gibrish גיל..בשנים. מיגדר 1 23.0 זכר 2 24.0 נקבה 3 23.0 נקבה 4 24.0 נקבה 5 25.0 זכר 6 18.0 זכר 7 26.0 זכר 8 21.5 נקבה 9 24.0 זכר 10 26.0 זכר 11 24.0 זכר 12 19.0 נקבה 13 19.0 נקבה 14 24.5 זכר 15 21.0 נקבה > x[,2] [1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה Levels: זכר נקבה > colnames(x) [1] "גיל..בשנים." "מיגדר" >

In both sessions, my sessionInfo ():

 > sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 [3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C [5] LC_TIME=Hebrew_Israel.1255 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] installr_0.17.0

I am using the latest version of RStudio 0.99.892

Thanks.

+2

r csv utf-8 rstudio hebrew

Tal galili Mar 13 '16 at 17:40

source share

1 answer

dof1985 · Answer 1 · 2016-08-02T21:42:57+0000

This is a mistake in R-studio, not the only one. I saw you got a general answer about the problems of R-studio, which currently have non-language support for locales in windows. As far as I know, this is not the first time / version having similar problems. You may also encounter some new problems that I think are related to victory 10. Please note that since I have other problems, I use English to print Hebrew.

So, I tried to debug your problem there and came up with some problems, and some new ideas (I think ..) about where the problem is. I think that it can be debugged in order to write a complete function that will fix it, but due to time (and hour) limitations, I decided to stay here.

I created this data:

 x <- data.frame("x"= c("דור","dor"))

As mentioned, using Hebrew locale I also get gibrish

 Sys.setlocale("LC_ALL", "Hebrew") [1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255" "דור" [1] "ãåø" x x 1 ãåø 2 dor

Using English, I get this conclusion.

 Sys.setlocale("LC_ALL", "English") [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" "דור" [1] "דור" x x 1 <U+05D3><U+05D5><U+05E8> 2 dor

Note that the output is not data.frame prints fine. It also occurs with the data.table class and prints fine with list and matrix .

Checking the print.data.frame and print.table reveals the main suspect: format .

Further research confirms these suspicions:

 as.matrix(x) x [1,] "דור" [2,] "dor" format(as.matrix(x)) x [1,] "<U+05D3><U+05D5><U+05E8>" [2,] "dor "

As such, in your case, I suggest performing the following workflow:

 Sys.setlocale("LC_ALL", "Hebrew") x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") as.matrix(x) âéì..áùðéí. îéâãø [1,] "23.0" "זכר" [2,] "24.0" "נקבה" [3,] "23.0" "נקבה" [4,] "24.0" "נקבה" [5,] "25.0" "זכר" [6,] "18.0" "זכר" [7,] "26.0" "זכר" [8,] "21.5" "נקבה" [9,] "24.0" "זכר" [10,] "26.0" "זכר" [11,] "24.0" "זכר" [12,] "19.0" "נקבה" [13,] "19.0" "נקבה" [14,] "24.5" "זכר" [15,] "21.0" "נקבה"

Both locales: Hebrew and English worked on my machine, but col.names did not work for any.

In conclusion, this is far from a complete solution, but just a small and partial processing of a print problem (or with a reminder of formatting). He also shed even more light on this Hebrew / non-English issue in R-studio, on which some of the best solutions can be written. One example of a solution to a similar problem of writing Hebrew in windows can be seen in this SO stream .

Reading a UTF-8 text file (in Hebrew) displays gibrish in the RStudio console and excellent in RGUI

More articles: