Coding and raw in R

Question

Coding and raw in R

I am not sure if this is a mistake or not. If I encode one of the characters in UTF-8 before converting it to raw and vice versa, then the characters do not match. I set the default encoding for UTF-8 in RStudio.

rawToChar(charToRaw(enc2utf8("vægt"))) [1] "vÃ¦gt" rawToChar(charToRaw("vægt")) [1] "vægt"

Here is my sessionInfo ()

 R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 [4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ggthemes_2.2.1 TTR_0.23-0 lubridate_1.3.3 tidyr_0.2.0 skm_1.0.2 ggplot2_1.0.1 dplyr_0.4.3 [8] stringr_1.0.0 dkstat_0.08 loaded via a namespace (and not attached): [1] Rcpp_0.12.1 rstudioapi_0.3.1 magrittr_1.5 MASS_7.3-43 munsell_0.4.2 lattice_0.20-33 [7] colorspace_1.2-6 R6_2.1.1 httr_1.0.0 plyr_1.8.3 xts_0.9-7 tools_3.2.2 [13] parallel_3.2.2 grid_3.2.2 gtable_0.1.2 DBI_0.3.1 lazyeval_0.1.10 assertthat_0.1 [19] digest_0.6.8 reshape2_1.4.1 curl_0.9.3 memoise_0.2.1 labeling_0.3 stringi_0.5-5 [25] scales_0.3.0 jsonlite_0.9.17 zoo_1.7-12 proto_0.3-10

+5

r character-encoding

Kero Oct 11 '15 at 17:37

source share

1 answer

Whiteviking · Accepted Answer · 2015-10-11T18:59:44+0000

Here is my basic understanding of what is happening.

First, some coding facts:

  Encoding character UTF-8 CP1252 v 76 76 æ c3 a6 e6 g 67 67 t 74 74 Ã c3 83 c3 ¦ c2 a6 a6

Now the mechanics:

The Windows machine uses CP1252 , as seen from the output of sessionInfo . Thus, the string vægt in the R script is represented as bytes 76 e6 67 74 . This is confirmed by charToRaw("vægt") . If we then convert it to UTF-8, we get 76 c3 a6 67 74 . The fact that these bytes represent UTF-8 is lost. rawToChar() later converts these bytes back to a string, again accepting CP1252. Since c3 Ã and a6 are ¦ in CP1252, we get vÃ¦gt .

On Mac and Linux, on the other hand, the default encoding is UTF-8, and there are no inconsistencies in the encoding. I suspect, however, that the same phenomenon as in Windows may be caused by an explicit change / setting of the encoding used by R.

I do not think this is a mistake.

Coding and raw in R

More articles: