The Yihui xfun xfun has a read_utf8 function that tries to read a file and assumes that it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is issued informing which lines contain non-UTF-8 characters. Under the hood, the non-exported function xfun:::invalid_utf8() , but simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))) .
To detect specific non-UTF-8 words in a string, you can slightly modify the above and do something like:
invalid_utf8_ <- function(x){ !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8")) } detect_invalid_utf8 <- function(string, seperator){ stringSplit <- unlist(strsplit(string, seperator)) invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_)) data.frame( word = stringSplit[invalidIndex], stringIndex = which(invalidIndex == TRUE) ) } x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade" detect_invalid_utf8(x, " ")
source share