tl; dr: emoji are not valid HTML objects; UTF-16 numbers were used to build them instead of Unicode codes. I am describing an algorithm at the bottom of the answer to convert them so that they are valid XML.
Problem identification
R definitely controls emoji:

Actually there are several packages for handling emoji in R. For example, emojifont and emo both allow you to extract emoji based on Slack keywords. It's just a matter of getting your source characters from an HTML-escaped format so you can convert them.
xml2::read_xml seems to xml2::read_xml just fine with other HTML objects like ampersands or double quotes. I looked at this SO answer to find out if there are any XML restrictions for HTML objects, and it seemed like they stored emoji well. So I tried changing the emoji codes in your reprex to the ones that were in this answer:
body="Hug emoji: 😀😃"
And, of course, they were saved (although they are obviously not an embrace of emoji):
> test8 = read_html('Desktop/test.xml') > test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body') [1] "Hug emoji: \U0001f600\U0001f603"
I lifted an emoji hug to this page , and the decimal HTML object listed there doesnβt matter �� . It looks like the UTF-16 decimal codes for emoji have been wrapped in and ; .
In conclusion, I think the answer is that your emojis are, in fact, not valid HTML objects. If you cannot control the source, you may need pre-processing to account for these errors.
So why does the browser convert them correctly? I am wondering if the browser is a little more flexible with these things and makes some guesses about what these codes might be. I'm just thinking, though.
Convert UTF-16 to Unicode Code Points
After some more in-depth research, it looks like actual emoji HTML objects use a Unicode code point (in decimal if it's ...; or hex if it's ...; ). Unicode code point is different from UTF-8 or UTF-16 code. (This link explains a lot about how emojis and other characters are encoded differently, BTW! Read well.)
Therefore, we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16 , I checked how this is done. Each Unicode code point (our goal) is a 20-bit number or five hexadecimal digits. When you switch from Unicode to UTF-16, you break it into two 10-bit numbers (the average hexadecimal digit decreases in half, and two bits go to each block), do some mathematical calculations and get the result).
Going back as you want, this is done as follows:
- Your decimal number is UTF-16 (which is now in two separate blocks):
55358 56599 - Converting these blocks to hexadecimal (separately) gives
0x0d83e 0x0dd17 - You subtract
0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117 - Converting them to binary, adding them up to 10 bits and combining them,
0b0000 1111 1001 0001 0111 - Then we convert this back to hex which is
0x0f917 - Finally, add
0x10000 , giving 0x1f917 - Therefore, our (hexadecimal) HTML object
🤗 . Or, in decimal, 🤗
So, in order to pre-process this data set, you need to extract the existing numbers, use the above algorithm, and then return the result (with one ...; rather than two).
Display emoji in R
As far as I know, there is no solution for printing emoji in the R console: they always come out as "U0001f600" (or whatever you have). However, the packages described above can help you plan your emoji in some circumstances (I hope to expand the ggflags , emoji color at some point). They can also help you look for emoji to get your codes, but they cannot get names given AFAIK codes. But maybe you can try to import the emoji list from emojilib into R and connect to your data frame if you pulled emoji into a column to get English names.