Language Definition in PHP (UTF-8)

Question

Language Definition in PHP (UTF-8)

What code snippets exist to detect the language of a text snippet UTF-8? I basically need to filter out a lot of spam that happens in Chinese and Arabic. There is a PECL extension for this, but I want to do this exclusively in PHP code. I think I need to scroll the Unicode line with the unicode version of ord (), and then create some range table for different languages.

+3

php internationalization

deadprogrammer Feb 04 '09 at 18:10

source share

4 answers

cletus · Answer 1 · 2009-02-04T22:01:55+0000

Swipe text through Google discovery . You can do this through AJAX. Here is the documentation / developer guide . For example:

<html>
  <head>
    <script type="text/javascript" src="http://www.google.com/jsapi"></script>
    <script type="text/javascript">

    google.load("language", "1");

    function initialize() {
      var text = document.getElementById("text").innerHTML;
      google.language.detect(text, function(result) {
        if (!result.error && result.language) {
          google.language.translate(text, result.language, "en",
                                    function(result) {
            var translated = document.getElementById("translation");
            if (result.translation) {
              translated.innerHTML = result.translation;
            }
          });
        }
      });
    }
    google.setOnLoadCallback(initialize);

    </script>
  </head>
  <body>
    <div id="text">你好，很高興見到你。</div>
    <div id="translation"></div>
  </body>
</html>

Gumbo · Answer 2 · 2009-02-04T18:14:57+0000

UTF-8 " ".

function utf8ToUnicode($utf8)
{
    if (!is_string($utf8)) {
        return false;
    }
    $unicode  = array();
    $mbbytes  = array();
    $mblength = 1;
    $strlen   = strlen($utf8);

    for ($i = 0; $i < $strlen; $i++) {
        $byte = ord($utf8{$i});
        if ($byte < 128) {
            $unicode[] = $byte;
        } else {
            if (count($mbbytes) == 0) {
                $mblength = ($byte < 224) ? 2 : 3;
            }
            $mbbytes[] = $byte;
            if (count($mbbytes) == $mblength) {
                if ($mblength == 3) {
                    $unicode[] = ($mbbytes[0] & 15) * 4096 + ($mbbytes[1] & 63) * 64 + ($mbbytes[2] & 63);
                } else {
                    $unicode[] = ($mbbytes[0] & 31) * 64 + ($mbbytes[1] & 63);
                }
                $mbbytes = array();
                $mblength = 1;
            }
        }
    }
    return $unicode;
}

troelskn · Answer 3 · 2009-02-04T20:14:35+0000

, , , , , . () .

Christophe strobbe · Answer 4 · 2016-08-09T09:51:11+0000

Unicode range 0600-06FF. Unicode .. . , code range 0750-077F , . 08A0-08FF , , .. , FB50-FDFF FE70-FEFF, , , 0600-06FF.

( ) Unicode ( ). 4E00-9FD5. , , script, , Unicode Consortium .

So, if you need to filter only Arabic and Chinese scripts and don’t want to use the approach suggested by troelskn (i.e., using lists of common words for languages that you want to identify) does not scale too well for a large number of languages), it is enough to determine the range of code characters at your input. StackOverflow has already resolved an earlier question about methods for determining Unicode ranges in PHP .

Language Definition in PHP (UTF-8)

More articles: