There are a number of unanswered questions regarding the problem domain. For the rest, let me use the following data containing the provided sample data in the question of positive matches and some additional sample data for negative matches (I use R version 2.14.1 (2011-12-22) ):
x <- c("140,000 mostly freeway miles", "173k commuter miles. ", "154K(all highway) miles", "1,24 almost but not mostly freeway miles", "1,2,3,4K MILES")
1,2,3,4K MILES added as a negative match, because the question defines about 1-3 words apart , and it has a zero value "next to words".
If we use the following ...
sub('[\\d,]+k?\\s+(([^\\s]+\\s+){1,3})miles', '\\1', x, ignore.case = TRUE, perl = TRUE)
... we get:
[1] "mostly freeway " [2] "commuter . " [3] "154K(all highway) miles" [4] "1,24 almost but not mostly freeway miles" [5] "1,2,3,4K MILES"
This is probably not the result you want. Since the data is not normalized, you will have to use a regex pattern, which will be very complex. As Justin suggests in his answer, clean up the data first then do some simpler matching .
You can normalize the data as follows:
y <- gsub('\\pP+', ' ', x, perl = TRUE) y <- gsub('\\s+', ' ', y, perl = TRUE) y <- gsub('^\\s+|\\s+$', '', y, perl = TRUE) y <- gsub('(\\d)\\s(?=\\d)', '\\1\\2', y, perl = TRUE)
See the links below for more information. This basically eliminates punctuation and ensures that words are separated by one space. This will leave you with y of:
[1] "140000 mostly freeway miles" [2] "173k commuter miles" [3] "154K all highway miles" [4] "124 almost but not mostly freeway miles" [5] "1234K MILES"
Now delete the lines that do not match what you are looking for:
y <- sub('^(?!\\d+k?\\s((?!miles)[^\\s]+\\s){1,3}miles).*$', '', y, ignore.case = TRUE, perl = TRUE) y [1] "140000 mostly freeway miles" "173k commuter miles" [3] "154K all highway miles" "" [5] ""
Finally, get โclose wordsโ:
y <- sub('^\\d+k?\\s((?!miles)[^\\s]+(\\s(?!miles)[^\\s]+){0,2})\\smiles', '\\1', y, ignore.case = TRUE, perl = TRUE) y [1] "mostly freeway" "commuter" "all highway" "" [5] ""
There are probably simpler ways to normalize the data, but this gives you some examples of regular expressions that you can play with.
For more information see