Ruby Regular Expression to match words, including accents and other UTF8 characters

We are trying to find a regular expression that allows us to divide sentences into words. Of course, the immediate answer is to use \w, except that it does not break into _which we need. Then we tried [a-zA-Z0-9](we would like to allow numbers inside words), the problem is that it is broken down into accents, which are quite common in many languages ​​...

So, ideally, which regular expression should be used to split the following sentence into the following words:

"Je ne déguste pas d'asperges, car je n'aime pas ça"

about

["Je", "n", "déguste", "pa", "d", "asperges", "car", "je", "n", "aime", "pas", "ça"]

+3
source share
1 answer
STR = "Je ne déguste pas d'asperges, car je n'aime pas ça"
words = STR.split /[\s,']+/
for w in words
    print w, "\n"
end

Conclusion:

Je
ne
déguste
pas
d
asperges
car
je
n
aime
pas
ça
+2
source

Source: https://habr.com/ru/post/1779516/


All Articles