R regular expression (related words)

I was wondering how to choose words next to each other using regular expressions. For example, I would like to select the numbers and the word miles from the following phrases:

"140,000 mostly freeway miles" "173k commuter miles. " "154K(all highway) miles 

I don't know how to fill in the optional words in the middle:

 [0-9]+ ???? miles 

* near can be defined as 1-3 words apart. Thank you for noticing this.

+4
source share
4 answers

Here is the answer in R Other answers may work with some changes. Basically, they should have "double escape sequences", and you will have to use the pair functions regexpr and regmatches .

 x=c("140,000 mostly freeway miles" ,"173k commuter miles. " ,"154K(all highway) miles") gsub('([[:digit:][:punct:]k]+).*(miles).*', '\\1 \\2', x, ignore.case=TRUE) # [1] "140,000 miles" "173k miles" "154 miles" 

This indicates punctuation of the numbers of groups or k in group 1. Follow this with anything. Then comes group 2, which is the word mile, followed by something else.

You can also use the syntax of the "regular" regular expression:

 gsub('([0-9,k]+).*(miles).*', '\\1 \\2', x, ignore.case=TRUE) 

However, I cleared the data first and then would have done a simpler mapping! (e.g. tolower and remove punctuation).

+3
source

There are a number of unanswered questions regarding the problem domain. For the rest, let me use the following data containing the provided sample data in the question of positive matches and some additional sample data for negative matches (I use R version 2.14.1 (2011-12-22) ):

 x <- c("140,000 mostly freeway miles", "173k commuter miles. ", "154K(all highway) miles", "1,24 almost but not mostly freeway miles", "1,2,3,4K MILES") 

1,2,3,4K MILES added as a negative match, because the question defines about 1-3 words apart , and it has a zero value "next to words".

If we use the following ...

 sub('[\\d,]+k?\\s+(([^\\s]+\\s+){1,3})miles', '\\1', x, ignore.case = TRUE, perl = TRUE) 

... we get:

 [1] "mostly freeway " [2] "commuter . " [3] "154K(all highway) miles" [4] "1,24 almost but not mostly freeway miles" [5] "1,2,3,4K MILES" 

This is probably not the result you want. Since the data is not normalized, you will have to use a regex pattern, which will be very complex. As Justin suggests in his answer, clean up the data first then do some simpler matching .

You can normalize the data as follows:

 y <- gsub('\\pP+', ' ', x, perl = TRUE) y <- gsub('\\s+', ' ', y, perl = TRUE) y <- gsub('^\\s+|\\s+$', '', y, perl = TRUE) y <- gsub('(\\d)\\s(?=\\d)', '\\1\\2', y, perl = TRUE) 

See the links below for more information. This basically eliminates punctuation and ensures that words are separated by one space. This will leave you with y of:

 [1] "140000 mostly freeway miles" [2] "173k commuter miles" [3] "154K all highway miles" [4] "124 almost but not mostly freeway miles" [5] "1234K MILES" 

Now delete the lines that do not match what you are looking for:

 y <- sub('^(?!\\d+k?\\s((?!miles)[^\\s]+\\s){1,3}miles).*$', '', y, ignore.case = TRUE, perl = TRUE) y [1] "140000 mostly freeway miles" "173k commuter miles" [3] "154K all highway miles" "" [5] "" 

Finally, get โ€œclose wordsโ€:

 y <- sub('^\\d+k?\\s((?!miles)[^\\s]+(\\s(?!miles)[^\\s]+){0,2})\\smiles', '\\1', y, ignore.case = TRUE, perl = TRUE) y [1] "mostly freeway" "commuter" "all highway" "" [5] "" 

There are probably simpler ways to normalize the data, but this gives you some examples of regular expressions that you can play with.

For more information see

+1
source

use this regex \d+([.,]\d+)?(?=.*?miles)

0
source

This is still a bit vague, but let's say we define everything as a โ€œwordโ€ separated by spaces. Therefore, if there can be 1-3 words, there should be 2-4 spaces (in fact, I will do the first option after seeing your last example) between the number and miles :

 \d[\d,.]*k?\s*(\S+\s+){1,3}miles 

Note that you must make this regular expression case insensitive in order to match both k and k .

Also note that the numerical part can certainly be improved. This one will simply take the first digit and then include as many digits, commas and periods as possible, regardless of whether it makes a valid number format or not.

0
source

Source: https://habr.com/ru/post/1447579/


All Articles