Processing utf-8 string

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
giomach
Forum Newbie
Posts: 18
Joined: Wed Jun 29, 2011 6:52 pm

Processing utf-8 string

Post by giomach »

I'd like to ask a general question about the efficiency of processing a utf-8 string, where each character in the string has to be processed in turn.

Am I right in assuming that
mb_substr($s, $i, 1, 'UTF-8')
can only get to the $ith character by working through from the start of $s?

Whereas
mb_substr($s, $i, 1, 'UTF-16')
will calculate where the bits for the $ith character are, and go directly to them?

If so, then it would seem better to make a once-off mb_convert_encoding of the string from utf-8 to utf-16, before repeatedly examining individual characters?

Is that reasoning valid, or am I missing something?

Thanks for advice.
maxx99
Forum Contributor
Posts: 142
Joined: Mon Nov 21, 2011 3:40 am

Re: Processing utf-8 string

Post by maxx99 »

Why would mb_substr work any different (iterate vs jump) for UTF-8 and UTF-16? where only: bytes/char is a difference?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Processing utf-8 string

Post by Weirdan »

giomach, UTF-16 is a variable-length encoding (just like UTF-8), so it would have to scan the string just the same. You might have confused it with UCS2, which is 2-byte fixed-width encoding.
giomach
Forum Newbie
Posts: 18
Joined: Wed Jun 29, 2011 6:52 pm

Re: Processing utf-8 string

Post by giomach »

That was exactly my mistake, Weirdan, thinking utf-16 was fixed-length.

So my question should be: is it worthwhile using mb_convert_encoding to do a once-off conversion of a string from utf-8 to utf-32 (where every character is 32 bits), before repeatedly accessing individual characters of the string using mb_substr?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Processing utf-8 string

Post by Weirdan »

I'd imagine that would be beneficial under some circumstances, especially when the string you have is large and is accessed very often. I'd advice you benchmark it yourself though.

What kind of task would require such processing?
giomach
Forum Newbie
Posts: 18
Joined: Wed Jun 29, 2011 6:52 pm

Re: Processing utf-8 string

Post by giomach »

Weirdan wrote:What kind of task would require such processing?
For example, searching a utf-8 text file of perhaps 1MB for occurrences of a particular word.
Post Reply