How does R handle Unicode / UTF-8?

Question

How does R handle Unicode / UTF-8?

If i write

`Δ` <- function(a,b) (ab)/a

then I can enable U+394 as long as it is enclosed in backlinks. (Unlike Δ <- function(a,b) (ab)/a does not work with unexpected input in " " .) So, apparently, R parses UTF-8 or Unicode or something like that . The assignment is going well, as is the assessment, for example,

 `Δ`(1:5, 9:13)

. And I can also evaluate Δ(1:5, 9:13) .

Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } , then λ ( U+3bb ) does not need to be "injected" into R with the reverse stroke. Then I can call winsorise(data, .1) without any problems.

The only mention in the R documentation that I can find in unicode is above my head. Could someone who understands this better explain to me what happens “under the hood” when R needs ` to understand the purpose for ♔, but can the parsing ♔ (a, b, c) after the appointment?

+6

r unicode utf-8

isomorphismes Feb 12 '15 at 17:03

source share

2 answers

drammock · Answer 1 · 2015-02-13T07:08:04+0000

I can’t talk about what is happening under the hood regarding function arguments and function arguments, but this letter from Professor Ripley has been able to shed some light since 2008 (excerpt below):

R pretty well views, prints and displays UTF-8 character data, but it is converted to native encoding for almost all manipulations at the character level (and not just on Windows). ?Encoding outlines exceptions [...]

The reason that R does this translation (at least on Windows) is mentioned in the OP related documentation with :

Windows does not have UTF-8 locales, but rather expects to work with UCS-2 strings. R (written in the C standard) will not work inside UCS-2 without significant changes.

The R documentation for ?Quotes explains how you can sometimes use characters outside the locale (emphasis added):

Identifiers consist of a sequence of letters, numbers, period (.) And underscore. They should not begin with a number or underscore, or with a period followed by a number. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale , but only ASCII digits are considered digits.
Such identifiers are also known as syntactic names and can be used directly in R code. Almost always, other names may be used provided that they are quoted . The preferred quote is the flip side (`), and the deparad will usually use it, but under many circumstances single or double quotes can be used (since a character constant will often be converted to a name). One place where inverse elements can be significant is the delimitation of variable names in formulas: see Formula.

There is another way to get characters that use the unicode escape sequence (for example, \u0394 for Δ). This is usually a bad idea if you use this symbol for anything other than the text in the plot (i.e. do not do this for variable or function names, see this quote from the R 2.7 Release Note when most current support for UTF-8):

If the string presented to the parser contains the escape code \ uxxxx in the current locale, the string is written in UTF-8 with the declared encoding. This will probably lead to an error if it is used later in the session, but can be printed and used for, for example, by building windows () on the device. So, "\ u03b2" gives the Greek little beta and "\ u2642" is a "male sign." Such lines will be printed, for example. <U+2642> except for the Rgui console (see below).

I think this applies to most of your questions, although I don’t know why there is a difference between function names and function argument arguments that you gave; hope someone who knows about this can call back. FYI, on Linux, all of these various methods of assigning and calling functions work without errors (since the locale of the system is UTF-8, so no translation is needed):

 Δ <- function(a,b) (ab)/a # no error `Δ` <- function(a,b) (ab)/a # no error "Δ" <- function(a,b) (ab)/a # no error "\u0394" <- function(a,b) (ab)/a # no error Δ(1:5, 9:13) # -8.00 -4.00 -2.67 -2.00 -1.60 `Δ`(1:5, 9:13) # same "Δ"(1:5, 9:13) # same "\u0394"(1:5, 9:13) # same sessionInfo() # R version 3.1.2 (2014-10-31) # Platform: x86_64-pc-linux-gnu (64-bit) # locale: # LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 # LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 # LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C # LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C # attached base packages: # stats graphics grDevices utils datasets methods base

Richie cotton · Answer 2 · 2015-02-12T17:58:22+0000

For the entry in R-devel (2015-02-11 r67792), Win 7, English in the UK, I see:

 options(encoding = "UTF-8") `Δ` <- function(a,b) (ab)/a ## Error: \uxxxx sequences not supported inside backticks (line 1) Δ <- function(a,b) (ab)/a ## Error: unexpected input in "\" "Δ" <- function(a,b) (ab)/a # OK `Δ`(1:5, 9:13) ## Error: \uxxxx sequences not supported inside backticks (line 1) Δ(1:5, 9:13) ## Error: unexpected input in "\" "Δ"(1:5, 9:13) ## Error: could not find function "Î""

How does R handle Unicode / UTF-8?

More articles: