Language Definition in PHP (UTF-8)

What code snippets exist to detect the language of a text snippet UTF-8? I basically need to filter out a lot of spam that happens in Chinese and Arabic. There is a PECL extension for this, but I want to do this exclusively in PHP code. I think I need to scroll the Unicode line with the unicode version of ord (), and then create some range table for different languages.

+3
source share
4 answers

Swipe text through Google discovery . You can do this through AJAX. Here is the documentation / developer guide . For example:

<html>
  <head>
    <script type="text/javascript" src="http://www.google.com/jsapi"></script>
    <script type="text/javascript">

    google.load("language", "1");

    function initialize() {
      var text = document.getElementById("text").innerHTML;
      google.language.detect(text, function(result) {
        if (!result.error && result.language) {
          google.language.translate(text, result.language, "en",
                                    function(result) {
            var translated = document.getElementById("translation");
            if (result.translation) {
              translated.innerHTML = result.translation;
            }
          });
        }
      });
    }
    google.setOnLoadCallback(initialize);

    </script>
  </head>
  <body>
    <div id="text">你好,很高興見到你。</div>
    <div id="translation"></div>
  </body>
</html>
+4
source

UTF-8 " ".

function utf8ToUnicode($utf8)
{
    if (!is_string($utf8)) {
        return false;
    }
    $unicode  = array();
    $mbbytes  = array();
    $mblength = 1;
    $strlen   = strlen($utf8);

    for ($i = 0; $i < $strlen; $i++) {
        $byte = ord($utf8{$i});
        if ($byte < 128) {
            $unicode[] = $byte;
        } else {
            if (count($mbbytes) == 0) {
                $mblength = ($byte < 224) ? 2 : 3;
            }
            $mbbytes[] = $byte;
            if (count($mbbytes) == $mblength) {
                if ($mblength == 3) {
                    $unicode[] = ($mbbytes[0] & 15) * 4096 + ($mbbytes[1] & 63) * 64 + ($mbbytes[2] & 63);
                } else {
                    $unicode[] = ($mbbytes[0] & 31) * 64 + ($mbbytes[1] & 63);
                }
                $mbbytes = array();
                $mblength = 1;
            }
        }
    }
    return $unicode;
}
+2

, , , , , . () .

0

Unicode range 0600-06FF. Unicode .. . , code range 0750-077F , . 08A0-08FF , , .. , FB50-FDFF FE70-FEFF, , , 0600-06FF.

( ) Unicode ( ). 4E00-9FD5. , , script, , Unicode Consortium .

So, if you need to filter only Arabic and Chinese scripts and don’t want to use the approach suggested by troelskn (i.e., using lists of common words for languages ​​that you want to identify) does not scale too well for a large number of languages), it is enough to determine the range of code characters at your input. StackOverflow has already resolved an earlier question about methods for determining Unicode ranges in PHP .

0
source

Source: https://habr.com/ru/post/1702708/


All Articles