Where is this gap hidden?

Question

Where is this gap hidden?

I have a character vector that is a file of some PDF scratch using pdftotext (command line tool).

Everything is (blissfully) beautifully built. However, the vector is riddled with a type of spaces that eludes my regular expressions:

 > test [1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care" [6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee" > grepl("[0-9]+ [A-Za-z ]+",test) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > dput(test) c("Address:", "Clinic Information:", "Store ", "351 South Washburn", "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", "Pewaukee") > test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", + "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", + "Pewaukee") > grepl("[0-9]+ [A-Za-z ]+",test.pasted) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE > Encoding(test) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" > Encoding(test.pasted) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"

It is clear that there is some kind of character that is not assigned in dput , as in the following question:

How to use the international text?

I can not copy / paste the whole vector .... How can I search and destroy spaces without spaces?

Edit

It is clear that I was not even close to clarity, because the answers are everywhere. Here's an even simpler test case:

 > grepl("Clinic Information:", test[2]) [1] FALSE > grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen [1] TRUE

There is one space between the word “Clinic” and “Information” printed on the screen and in the output of dput , but everything on the line is not standard. My goal is to eliminate this so that I can correctly remove this item.

+6

regex r

Ari B. Friedman Jul 28 '12 at 16:40

source share

4 answers

I think you will finish and lead the gap. If so, perhaps this function will work:

 Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Also watch out for tabs and such, and this can be useful:

 clean <- function(text) { gsub("\\s+", " ", gsub("\r|\n|\t", " ", text)) }

use clean and then trim, as in:

 Trim(clean(test))

Also pay attention to en dash (-) and em dash (-)

+1

Tyler rinker Jul 28 '12 at 16:49

source share

I don't see anything unusual in a space, but a dash in U+2010 (HYPHEN) phone numbers U+2010 (HYPHEN) , not an ASCII hyphen ( U+002D ).

+1

Alan moore Jul 28 '12 at 17:41

source share

 test <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", "Pewaukee") > grepl("[0-9]+ [A-Za-z ]+",test) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE library(stringr) test2 <- str_trim(test, side = "both") > grepl("[0-9]+ [A-Za-z ]+",test2) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE # So there were no spaces in the vector, just the screen output in this case.

0

Maiasaura Jul 28 '12 at 17:09

source share

Alan curry · Accepted Answer · 2012-07-28T20:51:11+0000

Upgrading my comment to answer:

Your line contains non-breaking space (U + 00A0), which was moved to normal space when it was inserted. Matching all weird space-like characters in Unicode is easy with a perl-style regular expression:

 grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)

The syntax is perl regexp \p{categoryName} , the optional backslash is part of the syntax of the string containing the backslash, and "Zs" is the Unicode separator category, the space subcategory. A simple method for the character U + 00A0 would be

 grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)

Where is this gap hidden?

More articles: