I have a character vector that is a file of some PDF scratch using pdftotext
(command line tool).
Everything is (blissfully) beautifully built. However, the vector is riddled with a type of spaces that eludes my regular expressions:
> test [1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care" [6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee" > grepl("[0-9]+ [A-Za-z ]+",test) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > dput(test) c("Address:", "Clinic Information:", "Store ", "351 South Washburn", "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", "Pewaukee") > test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", + "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", + "Pewaukee") > grepl("[0-9]+ [A-Za-z ]+",test.pasted) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE > Encoding(test) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" > Encoding(test.pasted) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
It is clear that there is some kind of character that is not assigned in dput
, as in the following question:
How to use the international text?
I can not copy / paste the whole vector .... How can I search and destroy spaces without spaces?
Edit
It is clear that I was not even close to clarity, because the answers are everywhere. Here's an even simpler test case:
> grepl("Clinic Information:", test[2]) [1] FALSE > grepl("Clinic Information:", "Clinic Information:")
There is one space between the word “Clinic” and “Information” printed on the screen and in the output of dput
, but everything on the line is not standard. My goal is to eliminate this so that I can correctly remove this item.