PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!
The author says that viewing the page source will show the ü that was used to preserve the special character, but my source shows: [text]Einstürzende Neubauten[/text]
The source file preserves the umlaut on the two text editors I've viewed it in so it is being saved to the file correctly. Can anyone explain this?
I'm running Firefox 8.0 and PHP 5.3 on Ubuntu 11.04.
I found an option in the save dialog to save the file as '(Western) ISO 8859-1' and now it works, so it was the file encoding causing the issue. Odd that it displayed correctly in the editor each time and not in the browser. I don't consider this solved until I know why this was an issue.
Is it because the PHP engine didn't know how to read the file? Or is it because the browser wasn't being sent the correct info for the chartype it was expecting?
htmlentities() only encodes what is passed to it based on the character encoding. PHP tries to guess what your encoding is if it is not specified in the file system or the php.ini. Your problem probably is in the way the text is stored on the file system when PHP goes to read it and then pass it htmlentities. Just from looking at your example, I would guess your test string (I'm assuming was in a file) was stored in a UTF-8 format and PHP garbled it when it pulled it in as a multibyte string.
One other thing about PHP encoding from the manual that might help if your string is in the script file itself:
Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1. Note, however, that state-dependent encodings where the same byte values can be used in initial and non-initial shift states may be problematic.
Actually, the only file was the source script itself, which began to behave as expected when I explicitly set the encoding to. So although it displayed correctly in the editor (which knew how the file was encoded), php engine didn't so stumbled on that special character.
From your second post, it seems that the default decoding for the function is ISO 8859-1 which wasn't compatible with how I was saving the file.
I used the mb_detect_encoding function you've linked to on the old version of the file (no encoding) and the new one (ISO-8859-1) to see how both strings were being encoded. Interestingly, the were both encoded as UTF-8. I had expected a difference.
I don't understand the last sentence of the paragraph you quote, but from what you've posted I understand what the problem was, or at least how to resolve it.