unicode woes, need help displaying text from a file
Posted: Sat Sep 20, 2008 10:23 pm
hi there,
i'm having some issues trying to read a file with zero-padded text.
using mb_detect_encoding() it returned that the encoding scheme was UTF-8.
i'm not exactly a unicode expert, so this has pretty much been trial and error for me for the past few days.
the function unicode_test() simply returns some text which i am attempting to display from a file containing unicode.
So curious to see what that magical "ì" character was, i used unpack() to see its true value.
I trimmed $content down to be just the "ticket" part, and was surprised to find this.
Array(
[1] => 116
[2] => -20
[3] => 99
[4] => 107
[5] => 101
[6] => 116
)
The unicode "ì" value is -20.
Now I am truly baffled.
If anyone can help, or has any suggestions, please let me know!
i'm having some issues trying to read a file with zero-padded text.
using mb_detect_encoding() it returned that the encoding scheme was UTF-8.
i'm not exactly a unicode expert, so this has pretty much been trial and error for me for the past few days.
the function unicode_test() simply returns some text which i am attempting to display from a file containing unicode.
Code: Select all
function unicode_test() {
ob_start();
// Bad file
$content = file_get_contents('C:\Documents and Settings\twig\Desktop\blah.vmg');
// This displays: UTF-8
drupal_set_message("content encoding = " . mb_detect_encoding($content));
// This displays: r?e?j?e?c?t?e?d? ?t???c?k?e?t?
echo $content;
// This is some test code to remove the zero-padding and turns it into normal ASCII text.
// It works fine, unless the text contains some accented characters (such as the downward accented ' ì ' U+0236 in ticket), which halts my php script and causes weird issues.
// This displays: rejected t?cket
$content = str_replace("\0", '', $content);
// Works but deletes accented character
// This displays: rejected t
echo mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content)) . '<br />';
// Stops converting when it hits an accented character.
// This displays: rejected t?cket
echo drupal_convert_to_utf8($content, mb_detect_encoding($content)) . '<br />';
/*
This displays:
# 0 = r
# ... (trimmed 'reject' output, displayed as expected)
# 9 = t
# 10 = ?ck
# 11 = e
# 12 = t
*/
for ($i = 0; $i < drupal_strlen($content); $i++) {
$c = drupal_substr($content, $i, 1);
drupal_set_message("$i = $c");
}
// Using the normal PHP functions
/*
This displays:
# 0 = r
...
# 9 = t
# 10 = ?
# 11 = c
# 12 = k
# 13 = e
# 14 = t
*/
for ($i = 0; $i < strlen($content); $i++) {
$c = substr($content, $i, 1);
drupal_set_message("$i = $c");
}
// This displays '0'. Strange, because I removed all "\0" characters!
echo intval(substr($content, 10, 1));
return ob_get_clean();
}
I trimmed $content down to be just the "ticket" part, and was surprised to find this.
Code: Select all
$r = unpack('c*', $content);
drupal_set_message('<pre>' . print_r($r, true) . '</pre>');
[1] => 116
[2] => -20
[3] => 99
[4] => 107
[5] => 101
[6] => 116
)
The unicode "ì" value is -20.
Now I am truly baffled.
If anyone can help, or has any suggestions, please let me know!