Looking for papers or algorithms regarding character sets

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Looking for papers or algorithms regarding character sets

Post by Chris Corbyn »

I have the potential to be misunderstood here, but here goes.

I want to understand *how* string libraries that respect character sets work. In particular, I'm interested in the algorithms that are involved with finding substrings and string lengths in terms of *characters* rather than bytes.

I'm not particularly interested in character set detection itself. Based on the assumption that you know what the name of the character set is, what's the most efficient way of finding each individual character in what is essentially a stream of bytes with no meaning?

Please no responses that tell me I should be using the multibyte string library etc... I'm looking to research the actual algorithms involved here.

I think one key thing I'm wondering about is the actual data structure that's used to store a character map and how you *efficiently* read a stream of bytes and group those bytes into characters based on that character map?

Any good articles/papers on this subject that people can point me to? :)

EDIT | I'm going to get off to a head start by delving into the source code of Java's character stream handling (InputStreamReader). No doubt that should provide some insight, even if it's likely to overwhelm as I browse through each of the dependencies.
xdecock
Forum Commoner
Posts: 37
Joined: Tue Mar 18, 2008 8:16 am

Re: Looking for papers or algorithms regarding character sets

Post by xdecock »

The first thing to know will be the different kind of character-set and the way each character is handled

We have
* Ascii type character (1 byte -> one char) -- This charset is self-resynchronysing, a lost byte, does not cause the rest of the string to be lost
* Multibyte characters (x bytes -> one char) -- This charset is NOT self-resynchronysing, a lost byte causes the rest of the string to be corrupted.
* utf-* characters (groups of 1->4 bytes with 1->n bits losts in length "signal") -- This charset is self-resynchronysing, if a byte is lost, we just have to ignore the bytes not starting by b0xxxxxxx or corrupted chars, the rest of the string is still decodable.

If i remmember correctly, we also have:
* XXX type where some control characters changes the way the next ones are handled (quite old, probably not used anymore) however, i don't know how the asiatic character-set works.
Post Reply