I want to understand *how* string libraries that respect character sets work. In particular, I'm interested in the algorithms that are involved with finding substrings and string lengths in terms of *characters* rather than bytes.
I'm not particularly interested in character set detection itself. Based on the assumption that you know what the name of the character set is, what's the most efficient way of finding each individual character in what is essentially a stream of bytes with no meaning?
Please no responses that tell me I should be using the multibyte string library etc... I'm looking to research the actual algorithms involved here.
I think one key thing I'm wondering about is the actual data structure that's used to store a character map and how you *efficiently* read a stream of bytes and group those bytes into characters based on that character map?
Any good articles/papers on this subject that people can point me to?
EDIT | I'm going to get off to a head start by delving into the source code of Java's character stream handling (InputStreamReader). No doubt that should provide some insight, even if it's likely to overwhelm as I browse through each of the dependencies.