Unescape unicode in character string

There is a long error in RJSONIO to parse json strings containing Unicode escape sequences. It seems that the error should be fixed in libjson , which may not happen in the near future, so I'm looking to create a workaround in R that cancels the \uxxxx sequences before submitting them to the json parser.

In some context: json data is always unicode, utf-8 used by default, so there is usually no need for escaping. But for historical reasons, json supports unicode escape code. Hence json data

 {"x" : "Zürich"} 

and

 {"x" : "Z\u00FCrich"} 

equivalent and should lead to an accurate result in the analysis. But for some reason, the latter does not work in RJSONIO . Additional confusion is caused by the fact that R itself also supports shielded unicode. Therefore, when we enter "Z\u00FCrich" into the R console, it automatically converts automatically to "Zürich" . To get the actual json string at hand, we need to avoid the backslash, which is the first character of the unicode escape sequence in json:

 test <- '{"x" : "Z\\u00FCrich"}' cat(test) 

So my question is: given the large json string in R, how can I cancel all Unicode escape sequences? That is, how to replace all occurrences of \uxxxx with the corresponding Unicode character? Again, \uxxxx here is a valid 6-character string, starting with a backslash. Therefore, the unescape function must satisfy:

 #Escaped string escaped <- "Z\\u00FCrich" #Unescape unicode unescape(escaped) == "Zürich" #This is the same thing unescape(escaped) == "Z\u00FCrich" 

One thing that can complicate things is that if the backslash is flushed to json with another slash, it is not part of the unicode escape sequence. For instance. unescape should also satisfy:

 #Watch out for escaped backslashes unescape("Z\\\\u00FCrich") == "Z\\\\u00FCrich" unescape("Z\\\\\\u00FCrich") == "Z\\\\ürich" 
+6
source share
3 answers

After playing with this, I also think that the best thing is I can find the patterns \uxxxx using a regular expression and then \uxxxx them using the R parser:

 unescape_unicode <- function(x){ #single string only stopifnot(is.character(x) && length(x) == 1) #find matches m <- gregexpr("(\\\\)+u[0-9a-z]{4}", x, ignore.case = TRUE) if(m[[1]][1] > -1){ #parse matches p <- vapply(regmatches(x, m)[[1]], function(txt){ gsub("\\", "\\\\", parse(text=paste0('"', txt, '"'))[[1]], fixed = TRUE, useBytes = TRUE) }, character(1), USE.NAMES = FALSE) #substitute parsed into original regmatches(x, m) <- list(p) } x } 

This seems to work for all cases, and I haven't found any odd side effects yet.

+4
source

The stringi package has a function:

 require(stringi) escaped <- "Z\\u00FCrich" escaped ## [1] "Z\\u00FCrich" stri_unescape_unicode(escaped) ## [1] "Zürich" 
+2
source

Maybe so?

 \"x\"\s:\s\"([^"]*?)\" 

These are not letters. Just waiting for a quote.

+1
source

Source: https://habr.com/ru/post/972821/


All Articles