Page 1 of 1

Unicode Disk Reading

Posted: Thu Oct 18, 2007 5:23 am
by mafro
Hey all,

I have a problem with reading from the local disk in a PHP jukebox app ive been working on for a couple months. Im posting on this forum in the hope that some reader has encountered this issue before and can resolve it once and for all! Searching the web only returned one answer, and that was not a desirable one - wait for php6..

This issue only applies to Linux (Debian etch) and OSX, running PHP 5.2.4. The issues does not appear on Windows.

Im reading a local file structure which contains mp3's and then displaying them in the browser, where the user can click each track to add it to a playlist. There's a java mp3 player part of the app which runs on the same server to handle the audio.

The problem is that directories/files which include non-ascii characters aren't read correctly by PHP. Here is an example, the first 2 lines may be the same - if you copy-paste out into a text editor you will see the difference. The second line denotes the unicode decimals for each string, and the third line is built from HTML entities using the unicode values. This will enable your browser to display a representation what PHP reads from my disk.

/mp3/Björk/
66 106 246 114 107

Code: Select all

B j ö r k
Is read as:

/mp3/Björk/

Code: Select all

B j o ̈ r k
It was my understanding that PHP would directly read binary data off the disk, where I could interpret it as whichever charset I desire. I believe OSX and Debian use utf-8 as their underlying charset. Converting this using mb_string in PHP doesnt work, and I am at a loss as to how this can be dealt with correctly.

I am using utf-8 encoding through out my application, this problem only exists when reading from the disk!

I can also provide some of my PHP test scripts should anyone like to attempt to solve this issue on their local machine..

Thanks for any/all help or suggestions.
mafro

Posted: Thu Oct 18, 2007 8:07 am
by s.dot
Are the titles being stored in a database and retrieved to show on the page?

Posted: Thu Oct 18, 2007 8:39 am
by mafro
Nope. As described, im reading the data from the disk.

Ive covered storing UTF-8 in my database and I can insert, retrieve and display this with no trouble.

Edit: just did another quick test. I can fopen() the path in question - it appears php can handle this. Therefore, I could load UTF-8 names from the database and open them.. But this is not my problem. Scandir() / readdir() return the incorrect characters.

Thanks for your input.

mafro

Posted: Thu Oct 18, 2007 10:47 pm
by Kieran Huggins
I would store the files as the md5 of themselves, that way there's no charset problem. Also, duplicate entries wouldn't take up extra room, and duplicate filenames could exist with different data (like different versions of the same song)