Evening
Does anyone have an idea what is the fastest way to determine the range of Unicode strings in PHP? I thought PHP would do something, but I can’t find anything. Ideally, what I want is a function that says that 100% John Jones is Latin OR Jones Gezik is 50% Latin and 50% Cyrillic.
In ReEx, you can do something like below:
strA = 'John Jones';
$strB = ' ј';
$strC = 'Հայաստանի Հանրապետություն';
preg_match( '~[\p{Cyrillic}\p{Common}]+~u', $strB, $res );
But this will require checking for every range, which does not seem like a good idea. In addition, you can get the unicode value of each character and check what range it is in. But I would suggest that someone has already done something like this.
EDIT
To give a little more information about why this might be useful, as noted in the comments, some people sometimes mix visually identical Latin and Cyrillic characters. for example, this is a search for Croatia with the Cyrillic alphabet "C", and the rest in Latin:
https://www.google.am/search?q=%22%D0%A1roatia%22&aq=f&oq=%22%D0%A1roatia%22
Repeat the search with the Latin alphabet, and you will get about 100,000,000 results instead of 20,000. In such cases, it would be desirable to replace the characters, as is appropriate in the context of the text. A good example of where such detection is useful is people who use the Cyrillic letter to bypass profanity filters.
source
share