Ruby Regular Expression to match words, including accents and other UTF8 characters

Question

Ruby Regular Expression to match words, including accents and other UTF8 characters

We are trying to find a regular expression that allows us to divide sentences into words. Of course, the immediate answer is to use \w, except that it does not break into _which we need. Then we tried [a-zA-Z0-9](we would like to allow numbers inside words), the problem is that it is broken down into accents, which are quite common in many languages ...

So, ideally, which regular expression should be used to split the following sentence into the following words:

"Je ne déguste pas d'asperges, car je n'aime pas ça"

about

["Je", "n", "déguste", "pa", "d", "asperges", "car", "je", "n", "aime", "pas", "ça"]

+3

ruby regex

Julien Genestoux Dec 10 '10 at 1:14

source share

1 answer

Brent newey · Accepted Answer · 2010-12-10T01:59:34+0000

STR = "Je ne déguste pas d'asperges, car je n'aime pas ça"
words = STR.split /[\s,']+/
for w in words
    print w, "\n"
end

Conclusion:

Je
ne
déguste
pas
d
asperges
car
je
n
aime
pas
ça

Ruby Regular Expression to match words, including accents and other UTF8 characters

More articles: