1000-66329" I want to delete and receive only 10...">

How to remove unicode <U + 00A6> from a string?

I have a line like:

q <-"<U+00A6>  1000-66329"

I want to delete <U+00A6>and receive only 1000 66329.

I tried using:

gsub("\u00a6"," ", q,perl=T)

But it doesn’t take anything off. How do I do gsubto get only 1000 66329?

+4
source share
4 answers

I just want to remove the unicode <U+00A6>that is at the beginning of the line.

Then you do not need gsub, you can use a template subwith a template "^\\s*<U\\+\\w+>\\s*":

q <-"<U+00A6>  1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)

Template Details :

  • ^ - beginning of line
  • \\s* - zero or more spaces
  • <U\\+ - literal sequence char <U+
  • \\w+ - 1 or more letters, numbers or underscores
  • > - literal >
  • \\s* - .

- , |- gsub ( , - akrun):

trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))

R -

+2

, :

substring("\U00A6 1000-66B29", 2)

R <U+00A6> 1000-66329 ¦ 1000-66B29, <U+00A6> "<U+00A6>" . :

substring("<U+00A6>  1000-66329",9)

:

[1] "  1000-66329"
+2

trimws(gsub("\\S+\\s+|-", " ", q))
#[1] "1000 66329"
+2

Instead of deleting, you should convert it to the appropriate format ... You should set your local UTF-8 as follows:

Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

You may see the following message:

Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored

In this case you should use stringi::stri_trans_general(x, "zh")

Here, zh means Chinese. You need to know in which language you should convert. What he

0
source

Source: https://habr.com/ru/post/1657472/


All Articles