I want to keep intra-word hyphens in the text until it is tokenized. The strategy involves replacing hyphens for a unique symbol, and then replacing this unique symbol with a hyphen after tokenization. Note. Ultimately, I use the Unicode class Pdto catch all forms of the dash character, but here I keep it simple, because I don’t think this part is related to the problem.
Problem: It fails when a word contains multiple internal hyphens separating a single character.
Examples and desired results:
replaceDash <- function(x) gsub("(\\w)-(\\w)", "\\1§\\2", x)
replaceDash("Hawaii-Five-O")
replaceDash("jack-of-all-trades")
replaceDash("A-bomb")
replaceDash("freakin-A")
replaceDash("jack-o-lantern")
replaceDash("Whack-a-Mole")
What regular expression patterns do I need for the first and second expressions gsub()?