Page 1 of 1

Converting stored procedure to MySQL REGEXP

Posted: Wed Aug 20, 2008 3:36 am
by sirfragalot
Hi,

In a latin1 MySQL database where I have UTF-8 characters stored in the columns, I occasionally have stored text that is not UTF-8 (due to various HTML FORM input issues) and that, when attempting to convert the database to UTF-8, causes the MySQL UTF-8 converter to truncate the data at the point it finds non-UTF-8 characters.

The thread detailing all this is here: viewtopic.php?f=2&t=86713&st=0&sk=t&sd=a and thanks to dml for the help.

dml came up with a stored procedure that would test each field for UTF-8ness which then allows me to handle the issue realtively gracefully when upgrading the database to UTF-8.

Code: Select all

('CREATE FUNCTION is_invalid_utf8(s BLOB)
        RETURNS int
        DETERMINISTIC
              BEGIN
                DECLARE i INT DEFAULT 1;
                DECLARE c INT;
                DECLARE num_expected_highs INT DEFAULT 0;
                WHILE (i <= length(s)) DO
                  SET c = ord(mid(s, i, 1));
                  IF num_expected_highs > 0 THEN
                    IF (c & 192)<>128 THEN  
                      RETURN 1;
                    END IF;
                    SET num_expected_highs = num_expected_highs - 1;
                  ELSE
                    IF (c & 224) = 192 THEN  
                      SET num_expected_highs = 1;
                    ELSEIF (c & 240) = 224 THEN  
                      SET num_expected_highs = 2;
                    ELSEIF (c & 248) = 240 THEN  
                      SET num_expected_highs = 3;
                    ELSEIF (c & 128 = 128) THEN
                      RETURN 1;
                    END IF;
                  END IF;
                  SET i = (i+1);
                END WHILE;
                RETURN num_expected_highs;
              END')
However, the stored procedure takes a significant time over the number of rows I have to test and the suggestion is that MySQL REGEXP would be quicker at detecting rows with non-UTF-8 characters. I've been trying to use the following (adapted from some PHP code) but without success. In other words, where the stored procedure finds my test-case non-UTF-8 character database field, the REGEXP does not (the result for each field is '0').

Code: Select all

SELECT `text` FROM `WKX_resource_note` WHERE `text` REGEXP
'%(\\?:
[\xC2-\xDF][\x80-\xBF]
|\xE0[\xA0-\xBF][\x80-\xBF]
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
|\xED[\x80-\x9F][\x80-\xBF]
|\xF0[\x90-\xBF][\x80-\xBF]{2}
|[\xF1-\xF3][\x80-\xBF]{3}
|\xF4[\x80-\x8F][\x80-\xBF]{2}
)+%xs'
Any ideas where I've gone wrong?

Re: Converting stored procedure to MySQL REGEXP

Posted: Wed Aug 20, 2008 7:58 am
by dml
I believe the problem is with the character escape sequences - that "\x80" isn't recognised by mysql as anything other than "\"+"8"+"0" - but I can't find anywhere in the documentation to show how to get those to work as intended in mysql, or if it's possible at all.

Re: Converting stored procedure to MySQL REGEXP

Posted: Wed Aug 20, 2008 11:25 pm
by sirfragalot
I've posted this at mysql.com:
http://forums.mysql.com/read.php?10,223346

I'll respond here if I get an answer.