How to identify / delete non-UTF-8 characters in R

Question

How to identify / delete non-UTF-8 characters in R

When I import a Stata dataset into R (using the foreign package), the import sometimes contains invalid UTF-8 characters. This in itself is rather unpleasant, but everything breaks as soon as I try to convert the object to JSON (using the rjson package).

How can I identify invalid- UTF-8 -characters in a string and delete them after that?

+19

r utf-8 stata

Marcel hebing Jun 25 '13 at 7:13

source share

4 answers

Instead of deleting them, you can try converting them to a UTF-8 string using iconv .

 require(foreign) dat <- read.dta("data.dta") for (j in seq_len(ncol(dat))) { if (class(dat[, j]) == "factor") levels(dat[, j]) <- iconv(levels(dat[, j]), from = "latin1", to = "UTF-8") }

You can replace latin1 with a more suitable application in your case. Since we don’t have access to your data, it’s hard to know which one will be more suitable.

+1

dickoa Jun 25 '13 at 7:53

source share

Another approach to remove bad characters using dplyr for the entire data set:

 library(dplyr) MyDate %>% mutate_at(vars(MyTextVar1, MyTextVar2), function(x){gsub('[^ -~]', '', x)})

Where MyData and MyTextVar are a dataset and text variables to remove bad guys. This may be less reliable than changing the encoding, but it is often simpler and easier to simply remove them.

+1

Tyler rinker Aug 15 '18 at 13:23

source share

The Yihui xfun xfun has a read_utf8 function that tries to read a file and assumes that it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is issued informing which lines contain non-UTF-8 characters. Under the hood, the non-exported function xfun:::invalid_utf8() , but simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))) .

To detect specific non-UTF-8 words in a string, you can slightly modify the above and do something like:

 invalid_utf8_ <- function(x){ !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8")) } detect_invalid_utf8 <- function(string, seperator){ stringSplit <- unlist(strsplit(string, seperator)) invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_)) data.frame( word = stringSplit[invalidIndex], stringIndex = which(invalidIndex == TRUE) ) } x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade" detect_invalid_utf8(x, " ") # word stringIndex # 1 façile 5 # 2 façade 9

0

conrad-mac Jul 29 '19 at 21:05

source share

agstudy · Accepted Answer · 2013-06-25T08:01:22+0000

Another solution using iconv and its sub argument is a character string. If not NA (here I set it to ``), it is used to replace any non-convertible bytes in the input.

 x <- "fa\xE7ile" Encoding(x) <- "UTF-8" iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by '' "faile"

Note here that if we choose the correct encoding:

 x <- "fa\xE7ile" Encoding(x) <- "latin1" xx <- iconv(x, "latin1", "UTF-8",sub='') facile

How to identify / delete non-UTF-8 characters in R

More articles: