Regular presentation of data using regular expressions in R

The keyword list contains Chinese characters and English words, as shown below:

[1] " 服务 接口 知识 组织 开放 查询 语义 推理 Web 服务 " [2] " Solr 分面 搜索 标准 信息管理 " [3] " 语义 W i k i 标注 导航 检索 S e m a n t i c M e d i a W i k i P A U X I k e W i k i " [4] " Liferay 主从 模式 集成 知识 平台 " [5] " 数据 摄取 SKE 本体 属性 映射 三元组 存储 " 

Some English words have a space between each character (for example, 3rd line), " W i k i ", " S e m a n t i c M e d i a W i k i ", " P A U X " , " I k e W i k i ". Among these words there are more than two spaces. Now I am trying to remove the gap in these English words to the results: " Wiki ", " SemanticMediaWiki ", " PAUX ", " IkeWiki ", and also save the other words as before. I used gsub before: " kwdict<-gsub("^[[:alpha:][:blank:]]+", "\\w", kwdict) ". But regardless of whether I use "\ w" or "[[: alpha:]]", the results are erroneous, all words have been changed. How can we precisely select these English words and remove the space in it?

 [1] " 服务 接口 知识 组织 开放 查询 语义 推理 Web 服务 " [2] " Solr 分面 搜索 标准 信息管理 " [3] " 语义 Wiki 标注 导航 检索 SemanticMediaWiki PAUX IkeWiki " [4] " Liferay 主从 模式 集成 知识 平台 " [5] " 数据 摄取 SKE 本体 属性 映射 三元组 存储 " 

I have tried many times using R with these sentences below separately

 kwdict<-gsub("[[:alpha:]/[:space:]{1}]", "", kwdict) kwdict<-gsub("[^[:alpha:]_[:space:]]{1}", "", kwdict) kwdict<-gsub("[^[:alpha:][:space:]]{1}", "", kwdict) kwdict<-gsub("[^[:alpha:][:space:]{1}^[:alpha:]]", "", kwdict) kwdict<-gsub("[//>[:space:]{1}]", "", kwdict) kwdict<-gsub("[[:alpha:][:space:]{1}]", "", kwdict) 

But he did nothing, deleted all the spaces or even cleared all the words! I think that since the template includes the "[: alpha:]" initial mark, we used to define a space character. Is there an idea to correctly define this pattern using R?

+5
source share
2 answers

Thanks to some comments by @ 赵鸿丰 and @waterling

I think I can find the source of your problem. The problem is that those words that, in your opinion, are English alphabets, they are not ascii in nature. They are actually the upper and lower case letters of the English alphabet. However, some of the alphabets are written in English ("Solar" and "Liferay").

Run the command below to convert it to UTF-8 (you may not need to do this, it’s more convenient for me to see things in UTF-8 format, and doing google gives me some better results in terms of UTF-8)

 string <- c(" 服务 接口 知识 组织 开放 查询 语义 推理 Web 服务 ", " Solr 分面 搜索 标准 信息管理 " , " 语义 W i k i 标注 导航 检索 S e m a n t i c M e d i a W i k i P A U X I k e W i k i ", " Liferay 主从 模式 集成 知识 平台 " , " 数据 摄取 SKE 本体 属性 映射 三元组 存储 ") Encoding(string) <- "UTF-8" 

As soon as you run the above command, you will see that there are UTF-8 values ​​with these characters. I searched the Internet to find out what this value means. I came across this site . This helped me understand the related UTF-8 values.

So, I wrote a small regular expression to solve your problem, I used the stringr library. You can select any / BASE R gsub library to solve your problem.

 value <- str_replace_all(string,'(?<=[\U{FF41}-\U{FF5A}]|[\U{FF21}-\U{FF3A}])\\s*',"") 

To understand the regex:

The character class (shown in square brackets) contains the UTF range of uppercase and lowercase letters LATIN (which I found on the above site). I put them in the regex lookaround assertion along with \ s, which stands for spaces. I matched the spaces and then replaced them with nothing. So I got your result as shown below. Hope this is what you expect. Also, since you cannot see this on your console, you can use the str_view_all function to see these alphabets when translated into html. I copied and pasted only the results.

 服务 接口 知识 组织 开放 查询 语义 推理 Web 服务Solr 分面 搜索 标准 信息管理语义 Wiki标注 导航 检索 SemanticMediaWikiPAUXIkeWiki Liferay 主从 模式 集成 知识 平台数据 摄取 SKE 本体 属性 映射 三元组 存储 

I hope this explains the solution to your problem in detail. Thanks!

After the OP comment, it seems that he wants to replace the wide Latin English form with ordinary letters, an external file is used to replace Unicode, this file (NamesList.txt) can be found in this link

 library(stringr) library(Unicode) ##Unicode is a beautiful library having lot of great functions such as u_char_from_name which is used here. rd_dt <- readLines("NamesList.txt",encoding="UTF-8") ##cleaning of Nameslist.txt which has unicode values against wide latin alphabet rd_dt1 <- rd_dt[grep("[[:alnum:]]{4}\t.*",rd_dt)] rd_dt1 <- read.delim(textConnection(rd_dt1),sep="\t",stringsAsFactors = F) rd_dt1 <- rd_dt1[,1:2] names(rd_dt1) <- c("UTF_8_values","Symbol") rd_dt1 <- rd_dt1[grep("LATIN",rd_dt1$Symbol),] rd_dt1 <- rd_dt1[grep("WIDTH",rd_dt1$Symbol),] value <- substr(rd_dt1$Symbol,nchar(trimws(rd_dt1$Symbol)),nchar(trimws(rd_dt1$Symbol))) rd_dt1$value <- value ###Assigning captial and small english letter to their corresponding latin wide small and captial letters letters <- grepl("CAPITAL",rd_dt1$Symbol)+0 captial_small <- ifelse(letters==1,toupper(rd_dt1$value),tolower(rd_dt1$value)) rd_dt1$capital_small <- captial_small rd_dt1 <- rd_dt1[,c(1,2,4)] ### From OP source taking the text which is non english and it is wide latin text dt <- c('SemanticMediaWikiPAUXIkeWiki') ###Check of the contents between UTF values of OP text content and the UTF-8 text files as.u_char(utf8ToInt(dt)) %in% u_char_from_name(rd_dt1$Symbol) 

Final answer For conversion:

paste0(rd_dt1[match(utf8ToInt(dt),u_char_from_name(rd_dt1$Symbol)),"capital_small"],collapse="")

Result:

 > paste0(rd_dt1[match(utf8ToInt(dt),u_char_from_name(rd_dt1$Symbol)),"capital_small"],collapse="") [1] "SemanticMediaWikiPAUXIkeWiki" 

CAVEAT : the above code works well with MACOSX Sierra and R-3.3, however, on windows automatically on the R studio console everything is converted to the corresponding English text, and I can not see UTF -8 codes against these texts. I can not determine the cause.

EDIT

I recently discovered that there is a function in the stringi library called stri_trans_general that can perform this task very efficiently, as soon as the spaces are deleted using the regular expression, as mentioned above, we can directly translate the wide-latin alphabet using the code below:

 dt <- c('SemanticMediaWikiPAUXIkeWiki') stringi::stri_trans_general(dt, "latin-ascii") 

The answer remains the same as above.

+1
source

You can solve this problem with two regular expressions, first eliminate one space between words:

s/(\a)\s{1}/\1/g

Then replace 2 or more spaces between words with one space:

s/\s{2,}/ /g

Applying these two regular expressions to the following text:

  T hisisatestcaseformyre gex
 W ordscanbearbitrarilys paced 

gives:

  This is a test case for my regex
 Words can be arbitrarily spaced
0
source

Source: https://habr.com/ru/post/1265216/


All Articles