Hebrew Encoding Hell in R and Writing UTF-8 Table in Windows

Question

Hebrew Encoding Hell in R and Writing UTF-8 Table in Windows

I am trying to save data extracted using RSelenium from https://www.magna.isa.gov.il/Details.aspx?l=he , but although R successfully prints the Hebrew character on the console, this is not when exporting TXT, CSV or other simple R functions like data.frame() , readHTMLTable() , etc.

Here is an example.

 > head(lines) [1] "גלובל פיננס ג'י.אר. 2 בע\"מ נתונים כספיים באלפי דולר ארה\"ב" [2] "513435404" [3] "" [4] "" [5] "" [6] "4,481"

The first line changes to strange characters (below) when using data.frame()

 > head(as.data.frame(lines)) [1] <U+05D2><U+05DC><U+05D5><U+05D1><U+05DC> <U+05E4><U+05D9><U+05E0><U+05E0><U+05E1> <U+05D2>'<U+05D9>.<U+05D0><U+05E8>. 2 <U+05D1><U+05E2>"<U+05DE> <U+05E0><U+05EA><U+05D5><U+05E0><U+05D9><U+05DD> <U+05DB><U+05E1><U+05E4><U+05D9><U+05D9><U+05DD> <U+05D1><U+05D0><U+05DC><U+05E4><U+05D9> <U+05D3><U+05D5><U+05DC><U+05E8> <U+05D0><U+05E8><U+05D4>"<U+05D1>

The same thing happens when exporting .TXT or .CSV to write.table or write.csv :

 write.csv(lines,"lines.csv",row.names=FALSE)

I tried changing the encoding to "UTF-8", as suggested in several similar questions, but the problem remains in a different format:

 iconv(lines, to = "UTF-8") 1 ׳'׳׳•׳'׳ ׳₪׳™׳ ׳ ׳¡ ׳''׳™.׳׳¨. 2 ׳'׳¢"׳ ׳ ׳×׳•׳ ׳™׳ ׳›׳¡׳₪׳™׳™׳ ׳'׳׳׳₪׳™ ׳"׳•׳׳¨ ׳׳¨׳""׳'

The same for Hebrew ISO-8859-8:

 iconv(lines, to = "ISO-8859-8") 1 ×'×o×.×'×o ×₪×T× × ×! ×''×T.××¨. 2 ×'×¢"×z × ×a×.× ×T× ×>×!×₪×T×T× ×'××o×₪×T ×"×.×o×¨ ××¨×""×'

I don’t understand why the console prints characters in Hebrew well, and write.table() , write.csv() and data.frame() represent encoding problems.

Anyone help me export it?

Ken responded to this, exporting text using writeLines () worked well:

 f = file("lines.txt", open = "wt", encoding = "UTF-8") writeLines(lines, "lines.txt", useBytes = TRUE) close(f)

However, the main problem of R is Hebrew , and - tables , in the form as.data.frame () , write.table ( ) and write.csv () . Any thoughts?

Some information about the car:

 Sys.info() sysname release version "Windows" "7 x64" "build 7601, Service Pack 1" nodename machine login "TALIS-TP" "x86" > Sys.getlocale() [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

+5

text encoding r hebrew

Daniel Rabetti Apr 26 '16 at 12:38

source share

1 answer

Ken benoit · Accepted Answer · 2016-04-26T15:15:25+0000

Many people have similar problems with UTF-8 text on platforms with 8-bit system encodings (Windows). Encoding in R can be complicated, because different methods handle encoding and conversions differently, and what seems to work fine on one platform (OS X or Linux) does not work well on another.

The problem is with your output connection and the way Windows handles encodings and text connections. I tried to reproduce the problem using some Hebrew texts in both UTF-8 and 8-bit encoding. We will also look at file reading issues, as there may be some interference.

For tests

Created a short Hebrew text file encoded as UTF-8: hebrew-utf8.txt
A short Hebrew text file is created, encoded as ISO-8859-8: hebrew-iso-8859-8.txt . (Note: You may need to tell your browser about the encoding to properly view this file — for example, for Safari.)

Ways to read files

Now give the experiment. I use Windows 7 for these tests (it really works on OS X, my regular OS).

 lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt") lines ## [1] "×"×¢×'×¨×™ ×"×•× ×—×'×¨ ×'×§×'×•×¦×" ×"×›× ×¢× ×™×ª ×©×œ ×©×¤×•×ª ×©×ž×™×•×ª." ## [2] "×–×• ×"×™×ª×" ×©×¤×ª× ×©×œ ×"×™×"×•×"×™× ×ž×•×§×"×, ××'×œ ×ž×Ÿ 586 ×œ×¤× ×"\"×¡ ×–×" ×"×ª×—×™×œ ×œ×"×™×•×ª ×ž×•×—×œ×£ ×¢×œ ×™×"×™ ×'××¨×ž×™×ª."

This failed because it was assumed that the encoding was your system encoding, Windows-1252. But due to the lack of conversion when reading files, you can fix this by simply setting the encoding bit to UTF-8:

 # this sets the bit for UTF-8 Encoding(lines) <- "UTF-8" lines ## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." ## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית."

But it is better to do this when you read the file:

 # this does it in one pass lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8") lines2[1] ## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." Encoding(lines2) ## [1] "UTF-8" "UTF-8"

Now let's see what happens if we try to read the same text, but it is encoded as an 8-bit ISO code page in Hebrew.

 lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt") lines3[1] ## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú."

Setting the encoding bits here will not help, because the read does not display Unicode code points for Hebrew, and Encoding() does not actually convert the encoding, it just sets an additional bit that can be used to indicate R - one of several possible encoding values. We could solve this problem by adding encoding = "ISO-8859-8" to the readLines() call. We can also convert text after loading using iconv() :

 # this will not fix things Encoding(lines3) <- "UTF-8" lines3[1] ## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa." # but this will iconv(lines3, "ISO-8859-8", "UTF-8")[1] ## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."

In general, I think the method used above for lines2 is the best approach.

How to output files while maintaining the encoding

Now to your question about how to write this: the safest way is to manage your connection at a low level, where you can specify the encoding. Otherwise, R / Windows is used by default to select the system encoding that UTF-8 will lose. I thought this would work, which works absolutely fine in OS X , and on OS X the writeLines() subtle call also works, just calling the text file without textConnection.

 ## to write lines, use the encoding option of a connection object f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8") writeLines(lines2, f) close(f)

But it does not work on Windows. You can see the results of Windows 7 here: hebrew-output-UTF-8-file_encoding.txt .

So, here is how to do it on Windows : once you are sure that your text is encoded as UTF-8, just write it as raw bytes without using any encoding, for example:

 writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE)

You can see the results on hebrew-output-UTF-8-useBytesTRUE.txt , which is now UTF-8 and looks right.

Added for write.csv

Please note that the only reason you would like to do this is to make the CSV file available for import into other software such as Excel. (And good luck working with UTF-8 on Excel / Windows ...) Otherwise, you should just write data.table as binary using write(myDataFrame, file = "myDataFrame.RData") . But if you really need to output .csv, then:

How to write UTF-8.csv files from `data.table` in Windows

The problem with writing UTF-8 files using write.table() and write.csv() is that these open text connections and Windows have encoding restrictions and text connections with respect to UTF-8. ( This post offers a useful explanation.) After the SO answer posted here , we can override this to write our own function for outputting UTF-8.csv files.

This assumes that you have already set Encoding() for any character elements in "UTF-8" (what happens when importing above for lines2 ).

 df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE) write_utf8_csv <- function(df, file) { firstline <- paste('"', names(df), '"', sep = "", collapse = " , ") data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")}) writeLines(c(firstline, data), file , useBytes = TRUE) } write_utf8_csv(df, "df_csv.txt")

When we now look at this file in an OS other than Unicode, it now looks fine:

 KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt "int" , "text" "1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." "2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית." KBsMBP15-2:Desktop kbenoit$ file df_csv.txt df_csv.txt: UTF-8 Unicode text, with CRLF line terminators

Hebrew Encoding Hell in R and Writing UTF-8 Table in Windows

For tests

Ways to read files

How to output files while maintaining the encoding

How to write UTF-8.csv files from data.table in Windows

More articles:

How to write UTF-8.csv files from `data.table` in Windows