unicode woes, need help displaying text from a file

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
koonkii
Forum Newbie
Posts: 2
Joined: Sat Sep 20, 2008 10:16 pm

unicode woes, need help displaying text from a file

Post by koonkii »

hi there,

i'm having some issues trying to read a file with zero-padded text.
using mb_detect_encoding() it returned that the encoding scheme was UTF-8.
i'm not exactly a unicode expert, so this has pretty much been trial and error for me for the past few days.

the function unicode_test() simply returns some text which i am attempting to display from a file containing unicode.

Code: Select all

 
function unicode_test() {
  ob_start();
 
  // Bad file
  $content = file_get_contents('C:\Documents and Settings\twig\Desktop\blah.vmg');
 
  // This displays: UTF-8
  drupal_set_message("content encoding = " . mb_detect_encoding($content));
 
  // This displays: r?e?j?e?c?t?e?d? ?t???c?k?e?t?
  echo $content;
 
  // This is some test code to remove the zero-padding and turns it into normal ASCII text.
  // It works fine, unless the text contains some accented characters (such as the downward accented ' ì ' U+0236 in ticket), which halts my php script and causes weird issues.
  // This displays: rejected t?cket
  $content = str_replace("\0", '', $content);
 
  // Works but deletes accented character
  // This displays: rejected t
  echo mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content)) . '<br />';
 
  // Stops converting when it hits an accented character.
  // This displays: rejected t?cket
  echo drupal_convert_to_utf8($content, mb_detect_encoding($content)) . '<br />';
 
  /*
  This displays:
# 0 = r
# ... (trimmed 'reject' output, displayed as expected)
# 9 = t
# 10 = ?ck
# 11 = e
# 12 = t
  */
  for ($i = 0; $i < drupal_strlen($content); $i++) {
    $c = drupal_substr($content, $i, 1);
    drupal_set_message("$i = $c");
  }
 
  // Using the normal PHP functions
  /*
  This displays:
# 0 = r
...
# 9 = t
# 10 = ?
# 11 = c
# 12 = k
# 13 = e
# 14 = t
  */
  for ($i = 0; $i < strlen($content); $i++) {
    $c = substr($content, $i, 1);
    drupal_set_message("$i = $c");
  }
 
  // This displays '0'. Strange, because I removed all "\0" characters!
  echo intval(substr($content, 10, 1));
 
  return ob_get_clean();
}
 
So curious to see what that magical "ì" character was, i used unpack() to see its true value.
I trimmed $content down to be just the "ticket" part, and was surprised to find this.

Code: Select all

 
  $r = unpack('c*', $content);
  drupal_set_message('<pre>' . print_r($r, true) . '</pre>');
 
Array(
[1] => 116
[2] => -20
[3] => 99
[4] => 107
[5] => 101
[6] => 116
)

The unicode "ì" value is -20.
Now I am truly baffled.
If anyone can help, or has any suggestions, please let me know!
User avatar
Punkis
Forum Newbie
Posts: 12
Joined: Sat Sep 20, 2008 4:07 pm

Re: unicode woes, need help displaying text from a file

Post by Punkis »

i see two errors in your code :!:
koonkii
Forum Newbie
Posts: 2
Joined: Sat Sep 20, 2008 10:16 pm

Re: unicode woes, need help displaying text from a file

Post by koonkii »

hmm, would you mind being a bit more descriptive and say what they are?
Post Reply