Removing text containing a non-English character

Question

This is my sample data:

Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)

I would like to remove the name containing the non-English character. For this sample, only the "apple firm" should remain.

I tried using the package tm, but it can only help me remove non-English characters instead of whole requests.

+4

Ran tao Mar 27 '17 at 14:21

3 answers

stringi stri_enc_isascii:

library(stringi)
stri_enc_isascii(data$Name)
# [1]  TRUE FALSE FALSE

,

, [ASCII], 1,2,..., 127 ( ?stri_enc_isascii).

+7

Henrik 27 . '17 14:51

iconv NA:

library(dplyr)
data <- data %>% 
         mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%
         filter(!is.na(Name))

mutate, , latin1 ASCII. , latin1 aka ISO 8859-1. , latin1, ASCII NA.

+4

jess 27 . '17 14:46

Mike H. · Accepted Answer · 2017-03-27T14:32:38+0000

I would look at this related post to do the same in javascript. Regular expression to match non-English characters?

To translate this to R, you can do (for non-ASCII matching):

res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]

res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

And to match non-Unicode for the same SO record:

  res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]

  res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

. , NUL. \u0000 x00 \u0001 \x01.