How to replace hyphens inside a word when hyphens surround one internal character

I want to keep intra-word hyphens in the text until it is tokenized. The strategy involves replacing hyphens for a unique symbol, and then replacing this unique symbol with a hyphen after tokenization. Note. Ultimately, I use the Unicode class Pdto catch all forms of the dash character, but here I keep it simple, because I don’t think this part is related to the problem.

Problem: It fails when a word contains multiple internal hyphens separating a single character.

Examples and desired results:

replaceDash <- function(x) gsub("(\\w)-(\\w)", "\\1§\\2", x)

# these are all OK
replaceDash("Hawaii-Five-O")  
## [1] "Hawaii§Five§O"
replaceDash("jack-of-all-trades")  
## [1] "jack§of§all§trades"
replaceDash("A-bomb")         
## [1] "A§bomb"
replaceDash("freakin-A")      
## [1] "freakin§A"

# not the desired outcome
replaceDash("jack-o-lantern")  # FAILS - should be "jack§o§lantern"
## [1] "jack§o-lantern"
replaceDash("Whack-a-Mole")    # FAILS - should be "Whack§a§Mole"
## [1] "Whack§a-Mole"

What regular expression patterns do I need for the first and second expressions gsub()?

+4
2

PCRE , , , .

replaceDash <- function(x) gsub("(\\w)-(?=\\w)", "\\1§", x, perl=T)

IDEONE

, (\\w) - 1, \\1 backreference, (?=\\w) , , , .

+3

, . , - :

gsub("(?<=\\w)-(?=\\w)", "§", "jack-o-lantern");
# jack§o§trade
-1

Source: https://habr.com/ru/post/1626443/


All Articles