Regular expression does not work properly with Turkish characters

I am writing a regular expression that should extract the following patterns;

  • "çççoookkk gggüüüzzzeeelll" (meaning vvveeerrryyy gggoooddd with Turkish characters "ç" and "ü").
  • "ccccoookkk ggguuuzzzeeelll" (this means the same, but with the English characters "c" and "u").

Here are the regular expressions I'm trying:

  • "\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in English, but not in Turkish characters.
  • "çok" : finds "çok", but when I try "ç+o+k+" does not work for "çççoookkk", it finds "çoookkk"
  • "güzel" : finds "güzel", but when I try "g+ü+z+e+l+" does not work for "gggüüüzzzeeelll"
  • "\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b" : does not work properly way
  • "[çc]ok\sg[uü]zel" : I also tried this to get the "çok güzel" template, but it doesn’t work.

I think the problem may be with the use of regex operators with Turkish characters. I do not know how I can solve this.

I use http://www.myregextester.com to check the validity of my regular expressions.

I use the Php programming language to get a specific template from the found tweets via Twitter Rest Api.

Thanks,

+3
source share
1 answer

You did not specify which programming language you are using, but in many of them the \b character class can only be used with simple ASCII encoding.

Internally, \b treated as the boundary between the \w and \w sets.
In turn, \w is [a-zA-Z0-9_] .

If you are not using any fancy space labels (you shouldn't), consider using the usual char ( \s ) space classes.

See this table (scroll down to Word Border ) to check if your Unicode language supports for \b . If he says: "ascii", then this is not so.

As a side note, depending on your programming language, you can use Unicode direct code points instead of national characters.

Se also: regex of utf-8 in javascript

Further reading:

+4
source

Source: https://habr.com/ru/post/1482799/


All Articles