Where is this gap hidden?

I have a character vector that is a file of some PDF scratch using pdftotext (command line tool).

Everything is (blissfully) beautifully built. However, the vector is riddled with a type of spaces that eludes my regular expressions:

 > test [1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care" [6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee" > grepl("[0-9]+ [A-Za-z ]+",test) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > dput(test) c("Address:", "Clinic Information:", "Store ", "351 South Washburn", "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", "Pewaukee") > test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", + "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", + "Pewaukee") > grepl("[0-9]+ [A-Za-z ]+",test.pasted) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE > Encoding(test) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" > Encoding(test.pasted) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown" 

It is clear that there is some kind of character that is not assigned in dput , as in the following question:

How to use the international text?

I can not copy / paste the whole vector .... How can I search and destroy spaces without spaces?

Edit

It is clear that I was not even close to clarity, because the answers are everywhere. Here's an even simpler test case:

 > grepl("Clinic Information:", test[2]) [1] FALSE > grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen [1] TRUE 

There is one space between the word “Clinic” and “Information” printed on the screen and in the output of dput , but everything on the line is not standard. My goal is to eliminate this so that I can correctly remove this item.

+6
source share
4 answers

Upgrading my comment to answer:

Your line contains non-breaking space (U + 00A0), which was moved to normal space when it was inserted. Matching all weird space-like characters in Unicode is easy with a perl-style regular expression:

 grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE) 

The syntax is perl regexp \p{categoryName} , the optional backslash is part of the syntax of the string containing the backslash, and "Zs" is the Unicode separator category, the space subcategory. A simple method for the character U + 00A0 would be

 grepl("[0-9]+[ \\xa0][A-Za-z ]+", test) 
+5
source

I think you will finish and lead the gap. If so, perhaps this function will work:

 Trim <- function (x) gsub("^\\s+|\\s+$", "", x) 

Also watch out for tabs and such, and this can be useful:

 clean <- function(text) { gsub("\\s+", " ", gsub("\r|\n|\t", " ", text)) } 

use clean and then trim, as in:

 Trim(clean(test)) 

Also pay attention to en dash (-) and em dash (-)

+1
source

I don't see anything unusual in a space, but a dash in U+2010 (HYPHEN) phone numbers U+2010 (HYPHEN) , not an ASCII hyphen ( U+002D ).

+1
source
 test <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", "Pewaukee") > grepl("[0-9]+ [A-Za-z ]+",test) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE library(stringr) test2 <- str_trim(test, side = "both") > grepl("[0-9]+ [A-Za-z ]+",test2) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE # So there were no spaces in the vector, just the screen output in this case. 
0
source

Source: https://habr.com/ru/post/921571/


All Articles