Convert encoding where output cannot represent characters

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Opcode caches speed parse times: the engine still has to execute that code and build an in-memory representation of the lookup table. I'm going off observations by MediaWiki developers that unserialize is extremely fast.
wei
Forum Contributor
Posts: 140
Joined: Wed Jul 12, 2006 12:18 am

Post by wei »

the usual bottle necks are disk IO, network delays, thus a large performance can be gained by first looking at these two problems. Tools such as ptrace, strace are very useful to find where the bottle necks might be.

http://www.schlossnagle.org/~george/tal ... e%20pdf%22

there was a pdf slide using ptrace a few weeks back, lost the link.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

well, you could convert your string from source to target charset using //IGNORE option, then convert back and look for differences. Every char missed from double-converted string should be encoded using html entity.

Not an elegant solution, of course :) I would prefer to have an ability to set //TRANSLIT callback.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

Those of you who think you know about performance might be able to help Astions.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Actually, Weirdan, that's quite an interesting solution. It solves the need to build lookup tables, although you still need to be able to parse the UTF-8 to figure out what precisely to put into the html entity.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I did more thinking about the technique, and it only works nicely for fixed-length encodings (especially 8-bit ASCII-compatible ones). Everything else and you have to implement a character gobbler (I'm sure that's not the term for it) for each encoding you want to support, which is almost as bad as having to setup lookup tables.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

usually there's one 'main' encoding with widest character range, with multiple input and output encoding (done via iconv or something). So there's a need for only one 'gobbler' (still not sure what did you mean by that).
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

You're right, brainfart. :-P Actually, it's a pretty elegant solution. I think I'll do that.
Post Reply