Unicode character range detection in PHP

Evening

Does anyone have an idea what is the fastest way to determine the range of Unicode strings in PHP? I thought PHP would do something, but I can’t find anything. Ideally, what I want is a function that says that 100% John Jones is Latin OR Jones Gezik is 50% Latin and 50% Cyrillic.

In ReEx, you can do something like below:

strA = 'John Jones';
$strB = ' ј';
$strC = 'Հայաստանի Հանրապետություն';
preg_match( '~[\p{Cyrillic}\p{Common}]+~u', $strB, $res );

But this will require checking for every range, which does not seem like a good idea. In addition, you can get the unicode value of each character and check what range it is in. But I would suggest that someone has already done something like this.

EDIT

To give a little more information about why this might be useful, as noted in the comments, some people sometimes mix visually identical Latin and Cyrillic characters. for example, this is a search for Croatia with the Cyrillic alphabet "C", and the rest in Latin:

https://www.google.am/search?q=%22%D0%A1roatia%22&aq=f&oq=%22%D0%A1roatia%22

Repeat the search with the Latin alphabet, and you will get about 100,000,000 results instead of 20,000. In such cases, it would be desirable to replace the characters, as is appropriate in the context of the text. A good example of where such detection is useful is people who use the Cyrillic letter to bypass profanity filters.

+3
source share
1 answer

-. . , . - , detectRanges , : http://jrgraphix.net/r/Unicode/ ' d , . , .

mb_internal_encoding("UTF-8");
echo header( "Content-Type: text/html;charset=UTF-8", true );

class DetectUnicodeRanges
{
    function entityToUTF8( $number )
    {
        if( $number < 0 )
                return false;

        # Replace ASCII characters
        if( $number < 128 )
                return chr( $number );

        # Replace illegal Windows characters
        if( $number < 160 )
        {
            switch( $number )
            {
                case 128: $conversion = 8364; break;
                case 129: $conversion = 160; break;
                case 130: $conversion = 8218; break;
                case 131: $conversion = 402; break;
                case 132: $conversion = 8222; break;
                case 133: $conversion = 8230; break;
                case 134: $conversion = 8224; break;
                case 135: $conversion = 8225; break;
                case 136: $conversion = 710; break;
                case 137: $conversion = 8240; break;
                case 138: $conversion = 352; break;
                case 139: $conversion = 8249; break;
                case 140: $conversion = 338; break;
                case 141: $conversion = 160; break;
                case 142: $conversion = 381; break;
                case 143: $conversion = 160; break;
                case 144: $conversion = 160; break;
                case 145: $conversion = 8216; break;
                case 146: $conversion = 8217; break;
                case 147: $conversion = 8220; break;
                case 148: $conversion = 8221; break;
                case 149: $conversion = 8226; break;
                case 150: $conversion = 8211; break;
                case 151: $conversion = 8212; break;
                case 152: $conversion = 732; break;
                case 153: $conversion = 8482; break;
                case 154: $conversion = 353; break;
                case 155: $conversion = 8250; break;
                case 156: $conversion = 339; break;
                case 157: $conversion = 160; break;
                case 158: $conversion = 382; break;
                case 159: $conversion = 376; break;
            }

            return $conversion;
        }

        if ( $number < 2048 )
                return chr( ($number >> 6 ) + 192 ) . chr( ( $number & 63 ) + 128 );
        if ( $number < 65536 )
                return chr( ( $number >> 12 ) + 224 ) . chr( ( ( $number >> 6 ) & 63 ) + 128 ) . chr( ( $number & 63 ) + 128 );
        if ( $number < 2097152 )
                return chr( ( $number >> 18 ) + 240 ) . chr( ( ( $number >> 12 ) & 63 ) + 128 ) . chr( ( ( $number >> 6 ) & 63 ) + 128 ) . chr( ( $number & 63 ) + 128 );

        return false;
    }

    function MBStrToHexes( $str )
    {        
        $str = mb_convert_encoding( $str, 'UCS-4BE' );
        $hexs = array();
        for( $i = 0; $i < mb_strlen( $str, 'UCS-4BE' ); $i++ )
        {        
            $s2 = mb_substr( $str, $i, 1, 'UCS-4BE' );                    
            $val = unpack( 'N', $s2 );
            $hexs[] = str_pad( dechex( $val[1] ), 4, 0, STR_PAD_LEFT );                
        }        
        return( $hexs );
    }

    function detectRanges( $str )
    {
        $hexes = $this->MBStrToHexes( $str );
        foreach( $hexes as $hex )
        {
            if( ( $hex >= '0041' ) && ( $hex <= '024f' ) )
                echo $this->entityToUTF8( hexdec($hex) ) . ' - Latin<br />';
            elseif( ( $hex >= '0400' ) && ( $hex <= '04ff' ) )
                echo $this->entityToUTF8( hexdec($hex) ) . ' - Cyrillic<br />';
            elseif( ( $hex >= '0530' ) && ( $hex <= '058f' ) )
                echo $this->entityToUTF8( hexdec($hex) ) . ' - Armenian<br />';
            else
                echo $this->entityToUTF8( $hex ) . ' - Some Other Range<br />';
        }
    }

}

#$strB = 'Cornelius Trow';
$strB = 'Cornelius  Հայաստանի';
#$strB = 'Հայաստանի Հանրապետություն';
echo 'Testing String: ' . $strB . '<br />';
$dur = new DetectUnicodeRanges();
$dur->detectRanges( $strB );
+1

Source: https://habr.com/ru/post/1702709/


All Articles