How to identify / delete non-UTF-8 characters in R

When I import a Stata dataset into R (using the foreign package), the import sometimes contains invalid UTF-8 characters. This in itself is rather unpleasant, but everything breaks as soon as I try to convert the object to JSON (using the rjson package).

How can I identify invalid- UTF-8 -characters in a string and delete them after that?

+19
source share
4 answers

Another solution using iconv and its sub argument is a character string. If not NA (here I set it to ``), it is used to replace any non-convertible bytes in the input.

 x <- "fa\xE7ile" Encoding(x) <- "UTF-8" iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by '' "faile" 

Note here that if we choose the correct encoding:

 x <- "fa\xE7ile" Encoding(x) <- "latin1" xx <- iconv(x, "latin1", "UTF-8",sub='') facile 
+19
source

Instead of deleting them, you can try converting them to a UTF-8 string using iconv .

 require(foreign) dat <- read.dta("data.dta") for (j in seq_len(ncol(dat))) { if (class(dat[, j]) == "factor") levels(dat[, j]) <- iconv(levels(dat[, j]), from = "latin1", to = "UTF-8") } 

You can replace latin1 with a more suitable application in your case. Since we don’t have access to your data, it’s hard to know which one will be more suitable.

+1
source

Another approach to remove bad characters using dplyr for the entire data set:

 library(dplyr) MyDate %>% mutate_at(vars(MyTextVar1, MyTextVar2), function(x){gsub('[^ -~]', '', x)}) 

Where MyData and MyTextVar are a dataset and text variables to remove bad guys. This may be less reliable than changing the encoding, but it is often simpler and easier to simply remove them.

+1
source

The Yihui xfun xfun has a read_utf8 function that tries to read a file and assumes that it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is issued informing which lines contain non-UTF-8 characters. Under the hood, the non-exported function xfun:::invalid_utf8() , but simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))) .

To detect specific non-UTF-8 words in a string, you can slightly modify the above and do something like:

 invalid_utf8_ <- function(x){ !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8")) } detect_invalid_utf8 <- function(string, seperator){ stringSplit <- unlist(strsplit(string, seperator)) invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_)) data.frame( word = stringSplit[invalidIndex], stringIndex = which(invalidIndex == TRUE) ) } x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade" detect_invalid_utf8(x, " ") # word stringIndex # 1 façile 5 # 2 façade 9 
0
source

Source: https://habr.com/ru/post/1488013/


All Articles