How can I match emoji with R regex?

Question

How can I match emoji with R regex?

I want to determine which elements of my vector contain emoji:

x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사') x # [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사"

Related posts cover only other languages, and since they mainly relate to specialized libraries, I couldn’t understand how to translate to R:

The second looked very promising, but alas (not fixed by delivery perl = TRUE ):

 x[grepl('[\u{1F600}-\u{1F6FF}]', x)]

Error: Invalid sequence \ u {xxxx} (line 1)

Similar problems arise from other issues. How can we match emoji to R?

+5

regex r utf-16 emoji

MichaelChirico Apr 12 '17 at 2:12

source share

1 answer

PKumar · Accepted Answer · 2017-04-12T13:14:51+0000

I am converting the encoding to UTF-8 to compare the UoF-8 value of the emoji value with the entire emoji value in the remoji library which is in UTF-8. I use stringr library to find emoji position in vector. One can use grep or any other function.

1st method:

 library(stringr) xvect = c('😂', 'no', '🍹', '😀', 'no', '😛') Encoding(xvect) <- "UTF-8" which(str_detect(xvect,"[^[:ascii:]]")==T) # [1] 1 3 4 6

Here 1,3,4 and 6 are the emoji symbol in this case.

Edited by:

Second method: Install a package called remoji using devtools using the command below, since we have already converted emoji elements to UTF -8. we can now compare the UTF- 8 values of all emoji present in the emoji library. Use trimws to remove spaces

 install.packages("devtools") devtools::install_github("richfitz/remoji") library(remoji) emj <- emoji(list_emoji(), TRUE) xvect %in% trimws(emj)

Output:

 which(xvect %in% trimws(emo)) # [1] 1 3 4 6

Both of these methods are not complete proof, and the first method assumes that there are no ascii characters in the vector except emojis, and the second method is based on remoji library information. In the case when some information about emoji is not present in the library, the last command can give FALSE instead of TRUE .

Final Edit:

According to the discussion between OP ( @MichaelChirico ) and @SymbolixAU . Thanks to both of them, it seems like a problem with a small typo of capital U. The new regular expression is xvect[grepl('[\U{1F300}-\U{1F6FF}]', xvect)] . The range in the character class is taken from F300 to F6FF. Of course, you can change this range to a new range in cases where emoji is outside this range. This may not be a complete list, and over time, these ranges may continue to increase / change.

How can I match emoji with R regex?

More articles: