Unicode Disk Reading

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
mafro
Forum Newbie
Posts: 2
Joined: Thu Oct 18, 2007 4:46 am

Unicode Disk Reading

Post by mafro »

Hey all,

I have a problem with reading from the local disk in a PHP jukebox app ive been working on for a couple months. Im posting on this forum in the hope that some reader has encountered this issue before and can resolve it once and for all! Searching the web only returned one answer, and that was not a desirable one - wait for php6..

This issue only applies to Linux (Debian etch) and OSX, running PHP 5.2.4. The issues does not appear on Windows.

Im reading a local file structure which contains mp3's and then displaying them in the browser, where the user can click each track to add it to a playlist. There's a java mp3 player part of the app which runs on the same server to handle the audio.

The problem is that directories/files which include non-ascii characters aren't read correctly by PHP. Here is an example, the first 2 lines may be the same - if you copy-paste out into a text editor you will see the difference. The second line denotes the unicode decimals for each string, and the third line is built from HTML entities using the unicode values. This will enable your browser to display a representation what PHP reads from my disk.

/mp3/Björk/
66 106 246 114 107

Code: Select all

B j ö r k
Is read as:

/mp3/Björk/

Code: Select all

B j o ̈ r k
It was my understanding that PHP would directly read binary data off the disk, where I could interpret it as whichever charset I desire. I believe OSX and Debian use utf-8 as their underlying charset. Converting this using mb_string in PHP doesnt work, and I am at a loss as to how this can be dealt with correctly.

I am using utf-8 encoding through out my application, this problem only exists when reading from the disk!

I can also provide some of my PHP test scripts should anyone like to attempt to solve this issue on their local machine..

Thanks for any/all help or suggestions.
mafro
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

Are the titles being stored in a database and retrieved to show on the page?
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
mafro
Forum Newbie
Posts: 2
Joined: Thu Oct 18, 2007 4:46 am

Post by mafro »

Nope. As described, im reading the data from the disk.

Ive covered storing UTF-8 in my database and I can insert, retrieve and display this with no trouble.

Edit: just did another quick test. I can fopen() the path in question - it appears php can handle this. Therefore, I could load UTF-8 names from the database and open them.. But this is not my problem. Scandir() / readdir() return the incorrect characters.

Thanks for your input.

mafro
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

I would store the files as the md5 of themselves, that way there's no charset problem. Also, duplicate entries wouldn't take up extra room, and duplicate filenames could exist with different data (like different versions of the same song)
Post Reply