In theory, you could do this:
- Find the Unicode ranges you want to check.
- Manually encode the start and end in UTF-8.
- Use the first byte of each of the encoded start and end as a range for REGEXP.
I believe that the CJK range is quite remote from things like the euro symbol, that there would be little or no false positives and false negatives.
Edit: We have now put theory into practice!
Step 1: Select a range of characters. I suggest \ u3000- \ u9fff; easy to test and give us almost perfect results.
Step 2: Encoding in bytes. (Wikipedia utf-8 page)
In our selected range, utf-8 encoded values ββwill always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.
Thus, we want mach bytes in the range from 11100011 to 11101001, or from 0xe3 to 0xe9.
Step 3: Make our regular expression using the very convenient (and just opened by me) UNHEX function.
SELECT * FROM `mydata` WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']')
Just tried it. Works like a charm. :)
source share