How to replace words between two punctuation

Question

How to replace words between two punctuation

I have a dataset that looks like this

sentence <- "active ingredients: avobenzone, octocrylene, octyl salicylate. other stuff inactive ingredients: water, glycerin, edta."

And I'm trying to get

  "avobenzone, octocrylene, octyl salicylate, water, glycerin, edta."

The logic that I think of in plain English matches everything between the punctuation and the semicolon to remove them. OR, the match between the beginning of a line and the semicolon and deletion. I use gsub in r and got to this:

  gsub("([:punct:][^:]*:)|^([^:]*:)", "", sentence)

but my result is ...

  [1] " avobe water, glycerin, edta."

Why did it hook everything between the first word to the last semicolony instead of the first? Can someone point me in the right direction to understand this logic?

Thanks!

+5

regex r

sir_chocolate_soup Mar 21 '18 at 10:35

source share

1 answer

G5w · Accepted Answer · 2018-03-21T22:41:41+0000

At least one way:

 gsub(".*?:\\s*(.*?)\\.", "\\1, ", sentence) [1] "avobenzone, octocrylene, octyl salicylate, water, glycerin, edta, "

Pay attention to? after. * This makes the match not greedy. No match ?,. * As much as possible.

Adding

The idea behind this is to replace everything except the part you want with nothing. You said that you wanted to stop at punctuation marks, but you obviously did not want to stop at commas, so I let you interpret the problem of how to find parts of the sting between the colon and period. In my expression .*?: matches all up to the first colon. I insert \\ s * to strip out any spaces that may follow the colon. We want everything after this until the next period. It is presented. *? \\. BUT we want to keep this part, so I put it in parentheses to make it a “capture group”. Since it is in parens, everything between the colon and the period will be stored in a variable named \ 1 (but you must enter \\ 1 to get the string \ 1). I also added a “,” (comma) to the end of the capture group to separate it from what comes next. SO It will take active ingredients: avobenzone, octocrylene, octyl salicylate. and replace it with avobenzone, octocrylene, octyl salicylate, Since I used gsub (global expansion), it will start working and try to do the same with the rest of the line, replacing other stuff inactive ingredients: water, glycerin, edta. on water, glycerin, edta, Sorry for the ugly trailing ",".

How to replace words between two punctuation

Adding

More articles: