I'd like to ask a general question about the efficiency of processing a utf-8 string, where each character in the string has to be processed in turn.
Am I right in assuming that
mb_substr($s, $i, 1, 'UTF-8')
can only get to the $ith character by working through from the start of $s?
Whereas
mb_substr($s, $i, 1, 'UTF-16')
will calculate where the bits for the $ith character are, and go directly to them?
If so, then it would seem better to make a once-off mb_convert_encoding of the string from utf-8 to utf-16, before repeatedly examining individual characters?
Is that reasoning valid, or am I missing something?
Thanks for advice.
Processing utf-8 string
Moderator: General Moderators
Re: Processing utf-8 string
Why would mb_substr work any different (iterate vs jump) for UTF-8 and UTF-16? where only: bytes/char is a difference?
Re: Processing utf-8 string
giomach, UTF-16 is a variable-length encoding (just like UTF-8), so it would have to scan the string just the same. You might have confused it with UCS2, which is 2-byte fixed-width encoding.
Re: Processing utf-8 string
That was exactly my mistake, Weirdan, thinking utf-16 was fixed-length.
So my question should be: is it worthwhile using mb_convert_encoding to do a once-off conversion of a string from utf-8 to utf-32 (where every character is 32 bits), before repeatedly accessing individual characters of the string using mb_substr?
So my question should be: is it worthwhile using mb_convert_encoding to do a once-off conversion of a string from utf-8 to utf-32 (where every character is 32 bits), before repeatedly accessing individual characters of the string using mb_substr?
Re: Processing utf-8 string
I'd imagine that would be beneficial under some circumstances, especially when the string you have is large and is accessed very often. I'd advice you benchmark it yourself though.
What kind of task would require such processing?
What kind of task would require such processing?
Re: Processing utf-8 string
For example, searching a utf-8 text file of perhaps 1MB for occurrences of a particular word.Weirdan wrote:What kind of task would require such processing?