In a latin1 MySQL database where I have UTF-8 characters stored in the columns, I occasionally have stored text that is not UTF-8 (due to various HTML FORM input issues) and that, when attempting to convert the database to UTF-8, causes the MySQL UTF-8 converter to truncate the data at the point it finds non-UTF-8 characters.
The thread detailing all this is here: viewtopic.php?f=2&t=86713&st=0&sk=t&sd=a and thanks to dml for the help.
dml came up with a stored procedure that would test each field for UTF-8ness which then allows me to handle the issue realtively gracefully when upgrading the database to UTF-8.
Code: Select all
('CREATE FUNCTION is_invalid_utf8(s BLOB)
RETURNS int
DETERMINISTIC
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE c INT;
DECLARE num_expected_highs INT DEFAULT 0;
WHILE (i <= length(s)) DO
SET c = ord(mid(s, i, 1));
IF num_expected_highs > 0 THEN
IF (c & 192)<>128 THEN
RETURN 1;
END IF;
SET num_expected_highs = num_expected_highs - 1;
ELSE
IF (c & 224) = 192 THEN
SET num_expected_highs = 1;
ELSEIF (c & 240) = 224 THEN
SET num_expected_highs = 2;
ELSEIF (c & 248) = 240 THEN
SET num_expected_highs = 3;
ELSEIF (c & 128 = 128) THEN
RETURN 1;
END IF;
END IF;
SET i = (i+1);
END WHILE;
RETURN num_expected_highs;
END')Code: Select all
SELECT `text` FROM `WKX_resource_note` WHERE `text` REGEXP
'%(\\?:
[\xC2-\xDF][\x80-\xBF]
|\xE0[\xA0-\xBF][\x80-\xBF]
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
|\xED[\x80-\x9F][\x80-\xBF]
|\xF0[\x90-\xBF][\x80-\xBF]{2}
|[\xF1-\xF3][\x80-\xBF]{3}
|\xF4[\x80-\x8F][\x80-\xBF]{2}
)+%xs'