How can I query text containing Asian characters in MySQL?

I have a MySQL table using a UTF-8 character set with a single column called WORDS of type longtext. The values ​​in this column are entered by users and are several thousand characters long.

There are two types of rows in this table:

  • In some lines, the meaning of the WORD is written in English and contains only characters used in ordinary English writing. (Not everything is necessarily ASCII, for example, the euro symbol may appear in some cases.)

  • Other lines have WORDS meanings written by native Asian speakers (Korean, Chinese, Japanese, and possibly others), which include a combination of English words and Asian words using their own logos (rather than, for example, Japanese Romaji).

How can I write a query that will return all rows of type 2 and rows of type 1? Alternatively, if this is difficult, is there a way to request most of these lines (is it okay here if I missed a few lines of type 2 or included some false positives of type 1)?

Update. The comments below suggest I better get along without the MySQL query engine, because its regex support for unicode is not very good. If this is true, I can extract the data to a file (using mysql -B -e "some SQL here" > extract.txt ) and then use perl or the like in the file. An answer using this method will be fine (but not as good as native MySQL!)

+4
source share
2 answers

In theory, you could do this:

  • Find the Unicode ranges you want to check.
  • Manually encode the start and end in UTF-8.
  • Use the first byte of each of the encoded start and end as a range for REGEXP.

I believe that the CJK range is quite remote from things like the euro symbol, that there would be little or no false positives and false negatives.

Edit: We have now put theory into practice!

Step 1: Select a range of characters. I suggest \ u3000- \ u9fff; easy to test and give us almost perfect results.

Step 2: Encoding in bytes. (Wikipedia utf-8 page)

In our selected range, utf-8 encoded values ​​will always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.

Thus, we want mach bytes in the range from 11100011 to 11101001, or from 0xe3 to 0xe9.

Step 3: Make our regular expression using the very convenient (and just opened by me) UNHEX function.

 SELECT * FROM `mydata` WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']') 

Just tried it. Works like a charm. :)

+2
source

You can also use the HEX value for the character. SELECT * FROM table WHERE <hex code>

Try SELECT HEX(column) FROM table

It may also help http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html

0
source

Source: https://habr.com/ru/post/1340443/


All Articles