How to check if UTF-8 binary string is in mysql?

I found a Perl regex that can check if the string is UTF-8 (the regex is from w3c site ).

$field =~ m/\A( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z/x; 

But I'm not sure how to port it to MySQL, since MySQL doesn't seem to support the hexadecimal representation of characters, see this question .

Any thoughts on how to redirect regexp to MySQL? Or maybe you know some other way to check if the UTF-8 string is correct?

UPDATE: I need this check working on MySQL, since I need to run it on the server in order to fix broken tables. I cannot transfer data through a script since the database is about 1 TB.

+3
source share
2 answers

I was able to restore my database using a test that only works if your data can be represented using single-byte encoding, in my case it was latin1.

I used the fact that mysql changes bytes that are not utf-8 to '?' when converting to latin1.

Here's what the check looks like:

 SELECT ( CONVERT( CONVERT( potentially_broken_column USING latin1) USING utf8)) != potentially_broken_column) AS INVALID .... 
+2
source

If you control both the input and output sides of this database, you should be able to verify that your UTF-8 data is on the side you like and apply restrictions if necessary. If you are dealing with a system in which you do not control the input side, you will have to check it after you pull it out and possibly convert it to your language of choice (Perl it looks like).

The database is REALLY good storage, but should not be used aggressively for other applications. I think this is one place where you should just let MySQL store data until you need to do something further.

If you want to continue the path you are on, check out this MySQL manual page: http://dev.mysql.com/doc/refman/5.0/en/regexp.html

REGEX is generally VERY similar between languages ​​(in fact, I almost always copy between JavaScript, PHP and Perl with a few adjustments for my transfer functions), so if it works with REGEX, you can easily transfer it.

GL!

EDIT: Look at this Stack article - you can use stored procedures, given that you cannot use scripts to process data: Regular expressions in stored procedures

Using stored procedures, you can scroll through the data and do a lot of processing without exiting MySQL. This second article will direct you back to the one I listed, so I think you need to check your REGEX first and make it work, and then look into the Stored Procedures.

0
source

Source: https://habr.com/ru/post/1300287/


All Articles