Can R read html-encoded emoji characters?

Question

My question explained below how R can be used to read a line containing emoji HTML codes such as �� , and either (1) represent the emoji character (for example, as the Unicode character: πŸ€— ) in the string to be analyzed, or (2) convert it to its text equivalent (" :hugging face: ")?

Background

I have an XML message dataset (from the Android / iOS [Signal] application) ( https://signal.org/ ) that I read in R for a text development project. The data looks like this: each text message presented in an sms node:

 <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <!-- File Created By Signal --> <smses count="1"> <sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: &#55358;&#56599;" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" /> </smses> 

Problem

I am currently reading this data using the xml2 package for R. When I use the xml2::read_xml , however, I get the following error message:

 Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : xmlParseCharRef: invalid xmlChar value 55358 

Which, as I understand it, indicates that the emoji character is not recognized as valid XML.

Using the xml2::read_html does work, but the emoji character falls. Here is a small example:

 example_text <- "Hugging emoji: &#55358;&#56599;" xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>"))) 

(Exit: [1] "Hugging emoji: " )

This symbol is valid. HTML - Googling &#55358;&#56599; actually converts it into a search string in the "hugging face" emoji and provides results related to that emoji.

Other information I found that seems relevant to this issue

I searched for Qaru and did not find any questions related to this particular problem. I also failed to find a table that directly gives the HTML codes next to the emoji that they represent, and therefore I cannot do (albeit inefficient) converting these HTML codes into my text equivalents in a large loop before parsing the data set; for example, neither this list nor its underlying dataset seems to enter line 55358 .

+6
source share
4 answers

tl; dr: emoji are not valid HTML objects; UTF-16 numbers were used to build them instead of Unicode codes. I am describing an algorithm at the bottom of the answer to convert them so that they are valid XML.


Problem identification

R definitely controls emoji:

enter image description here

Actually there are several packages for handling emoji in R. For example, emojifont and emo both allow you to extract emoji based on Slack keywords. It's just a matter of getting your source characters from an HTML-escaped format so you can convert them.

xml2::read_xml seems to xml2::read_xml just fine with other HTML objects like ampersands or double quotes. I looked at this SO answer to find out if there are any XML restrictions for HTML objects, and it seemed like they stored emoji well. So I tried changing the emoji codes in your reprex to the ones that were in this answer:

 body="Hug emoji: &#128512;&#128515;" 

And, of course, they were saved (although they are obviously not an embrace of emoji):

 > test8 = read_html('Desktop/test.xml') > test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body') [1] "Hug emoji: \U0001f600\U0001f603" 

I lifted an emoji hug to this page , and the decimal HTML object listed there doesn’t matter &#55358;&#56599; . It looks like the UTF-16 decimal codes for emoji have been wrapped in &# and ; .

In conclusion, I think the answer is that your emojis are, in fact, not valid HTML objects. If you cannot control the source, you may need pre-processing to account for these errors.

So why does the browser convert them correctly? I am wondering if the browser is a little more flexible with these things and makes some guesses about what these codes might be. I'm just thinking, though.


Convert UTF-16 to Unicode Code Points

After some more in-depth research, it looks like actual emoji HTML objects use a Unicode code point (in decimal if it's &#...; or hex if it's &#x...; ). Unicode code point is different from UTF-8 or UTF-16 code. (This link explains a lot about how emojis and other characters are encoded differently, BTW! Read well.)

Therefore, we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16 , I checked how this is done. Each Unicode code point (our goal) is a 20-bit number or five hexadecimal digits. When you switch from Unicode to UTF-16, you break it into two 10-bit numbers (the average hexadecimal digit decreases in half, and two bits go to each block), do some mathematical calculations and get the result).

Going back as you want, this is done as follows:

  • Your decimal number is UTF-16 (which is now in two separate blocks): 55358 56599
  • Converting these blocks to hexadecimal (separately) gives 0x0d83e 0x0dd17
  • You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
  • Converting them to binary, adding them up to 10 bits and combining them, 0b0000 1111 1001 0001 0111
  • Then we convert this back to hex which is 0x0f917
  • Finally, add 0x10000 , giving 0x1f917
  • Therefore, our (hexadecimal) HTML object &#x1f917; . Or, in decimal, 🤗

So, in order to pre-process this data set, you need to extract the existing numbers, use the above algorithm, and then return the result (with one &#...; rather than two).


Display emoji in R

As far as I know, there is no solution for printing emoji in the R console: they always come out as "U0001f600" (or whatever you have). However, the packages described above can help you plan your emoji in some circumstances (I hope to expand the ggflags , emoji color at some point). They can also help you look for emoji to get your codes, but they cannot get names given AFAIK codes. But maybe you can try to import the emoji list from emojilib into R and connect to your data frame if you pulled emoji into a column to get English names.

+5
source

I implemented the algorithm described by rensa above in R, and I shared it here. I am pleased to publish the CC0 dedication code below (i.e., including this implementation in the public domain for free reuse).

This is a fast and unpolished implementation of the rensa algorithm, but it works!

 utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){ string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2] string3a <- string_elements[1] string3b <- string_elements[2] string4a <- sprintf("0x0%x", as.numeric(string3a)) string4b <- sprintf("0x0%x", as.numeric(string3b)) string5a <- paste0( # "0x", as.hexmode(string4a) - 0xd800 ) string5b <- paste0( # "0x", as.hexmode(string4b) - 0xdc00 ) string6 <- paste0( stringi::stri_pad( paste0(BMS::hex2bin(string5a), collapse = ""), 10, pad = "0" ) %>% stringr::str_trunc(10, side = "left", ellipsis = ""), stringi::stri_pad( paste0(BMS::hex2bin(string5b), collapse = ""), 10, pad = "0" ) %>% stringr::str_trunc(10, side = "left", ellipsis = "") ) string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]])) string8 <- as.hexmode(string7) + 0x10000 unicode_pattern <- string8 unicode_pattern } make_unicode_entity <- function(x) { paste0("\\U000", utf16_double_dec_code_to_utf8(x)) } make_html_entity <- function(x) { paste0("&#x", utf16_double_dec_code_to_utf8(x), ";") } # An example string, using the "hug" emoji: example_string <- "test &#55358;&#56599; test" output_string <- stringr::str_replace_all( example_string, "(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes. make_unicode_entity # make_html_entity ) cat(output_string) # To print Unicode string (doesn't display in R console, but can be copied and # pasted elsewhere: # (This assumes you've used 'make_unicode_entity' above in the str_replace_all # call): stringi::stri_unescape_unicode(output_string) 
+1
source

JavaScript solution

I had exactly the same problem, but I needed a solution in JavaScript, not R. Using the @rensa comment above (very useful!), I created the following code to solve this problem, and I just wanted to share it in case if anyone else happens through this thread, as I did, but needed this in JavaScript.

 str.replace(/(&#\d+;){2}/g, function(match) { match = match.replace(/&#/g,'').split(';'); var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2); var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2); binFirst = '0000000000'.substr(binFirst.length) + binFirst; binSecond = '0000000000'.substr(binSecond.length) + binSecond; return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';'; }); 

And here is a complete snippet of his work if you want to run it:

 var str = '&#55357;&#56842;&#55357;&#56856;&#55357;&#56832;&#55357;&#56838;&#55357;&#56834;&#55357;&#56833;' str = str.replace(/(&#\d+;){2}/g, function(match) { match = match.replace(/&#/g,'').split(';'); var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2); var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2); binFirst = '0000000000'.substr(binFirst.length) + binFirst; binSecond = '0000000000'.substr(binSecond.length) + binSecond; return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';'; }); document.getElementById('result').innerHTML = str; // &#55357;&#56842;&#55357;&#56856;&#55357;&#56832;&#55357;&#56838;&#55357;&#56834;&#55357;&#56833; // is turned into // &#x1f60a;&#x1f618;&#x1f600;&#x1f606;&#x1f602;&#x1f601; // which is rendered by the browser as the emojis 
 <div>Original:<br> &#55357;&#56842;&#55357;&#56856;&#55357;&#56832;&#55357;&#56838;&#55357;&#56834;&#55357;&#56833;</div><br> Result:<br> <div id='result'></div> 

My SMS XML Parser application now works fine, but it dwells on large XML files, so I am thinking of rewriting it in PHP. If / when I do this, I will also post this code.

+1
source

JavaScript translated by Chad answer to Go, as I had the same problem too, but I needed to find a solution in Go.

https://play.golang.org/p/h9JBFzqcd90

 package main import ( "fmt" "html" "regexp" "strconv" "strings" ) func main() { emoji := "&#55357;&#56842;&#55357;&#56856;&#55357;&#56832;&#55357;&#56838;&#55357;&#56834;&#55357;&#56833;" regexp := regexp.MustCompile('(&#\d+;){2}') matches := regexp.FindAllString(emoji, -1) var builder strings.Builder for _, match := range matches { s := strings.Replace(match, "&#", "", -1) parts := strings.Split(s, ";") a := parts[0] b := parts[1] c, err := strconv.Atoi(a) if err != nil { panic(err) } d, err := strconv.Atoi(b) if err != nil { panic(err) } c = c - 0xd800 d = d - 0xdc00 e := strconv.FormatInt(int64(c), 2) f := strconv.FormatInt(int64(d), 2) g := "0000000000"[2:len(e)] + e h := "0000000000"[10:len(f)] + f j, err := strconv.ParseInt(g + h, 2, 64) if err != nil { panic(err) } k := j + 0x10000 _, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";") if err != nil { panic(err) } } fmt.Println(html.UnescapeString(emoji)) emoji = html.UnescapeString(builder.String()) fmt.Println(emoji) } 
0
source

Source: https://habr.com/ru/post/1274563/


All Articles