Page 1 of 1

Multi language rolodex style indexing

Posted: Mon Oct 15, 2007 4:12 pm
by alex.barylski
I have a rolodex style table of contents. It's basically A-Z which is generated by iterating 65 to 65+26 and converted to alphabetical characters via chr().

Code: Select all

for($i=65; $i<65+26; $i++)
  echo '<li style="text-align: center; margin: 1px; background-color: #eeaaee; float: left; width: '.(int)(100/26).'%"><a href="">'.chr($i).'</a></li>';
How do I ensure that if the user sets the language to chinese that the quick index is still properly generated? Does PHP handle this automatically if it's locale is set to "chinese" or other? I imagine I would have to remove the hardcoded values of 26 and 65. I assume many languages have their own virtual keycodes?

Posted: Mon Oct 15, 2007 4:21 pm
by Kieran Huggins
ugh... Chinese especially would be a problem. There's no "alphabet" like there is in western languages. Also, there's several different versions of their characters.

As for non-Roman alphabets (Greek, Cyrillic) I would suggest grabbing the first characters of each existing listing and merging them with a default alphabet.

Letter order is another issue altogether... should be one hell of an interesting can of worms!

Also, unless you're storing all your text in lowercase, your above method won't match many names. Case should be transparent.

Posted: Mon Oct 15, 2007 4:32 pm
by alex.barylski
Hmmm...I was thinking maybe just store the table of characters in the language tables instead of relying on the internal chr().

Code: Select all

VKEY_0 = "A"
VKEY_1 = "B"
...
Now I can keep the list length arbitrary (26 for English, X number for any others, etc) but just iterating the array of language codes until no more VKEY_xxx codes are found and I can use the number of iterations for calculating the percentage width required to fill screen width...

Not sure if that would work though, as all the index is doing is passing the single character to script to filter results basedon first character...

Code: Select all

index.php?list=2&vkey=A
Would that still work for chinese in querying the database?

Times like this I wish I knew more about other languages. :?

Posted: Mon Oct 15, 2007 8:01 pm
by shannah
A lot of your problems can be solved if you use MySQL to manage the data. Version 4.1 and higher allows you to specify an encoding and a collation for each table/field. The collation deals with the order of data (i.e. which letters come first), and the encoding deals with the actual characters that can be supported.

Another issue you'll be facing is that non latin languages (e.g. Chinese) require the use of 2 bytes to store each character. Unicode (UTF-8) is probably the best encoding to work with but PHP doesn't natively support multi-byte encodings. You can still pass multi-byte encoded strings around in variables and print them to the screen, and everything will work nicely, however, if you try to perform a function like substr() on such a string, PHP will give some unexpected results.

For example, suppose the variable $str contains a Chinese string.
Then substr($str,0,1) will return only the first byte of that string (which amounts to only half of the first character), so the results would be bizarre.

There is a solution. You need to have either the iconv or mbstring extensions installed in PHP, which come with versions of the PHP string functions that work on multi-byte strings.

In short, you've opened a can of worms as soon as you want to get into multi-byte languages like Chinese... but with some careful design you can still get what you want.

The standard for sites that will work in multiple languages is called i18n . You can search on google and you'll find some information.

I have blogged about some of the issues with multilanguage sites at http://www.phpi18n.com/.

Best regards
steve

Posted: Tue Oct 16, 2007 1:36 am
by alex.barylski
I don't really want to rely on MySQL to return the appropriate symbols/characters.

I think my idea of a look up table makes the most sense across the board - and it avoids using chr()

My problems now is, it's to my understanding that the chinese language has hundreds and hundreds of symbols but which are not characters like in English, but are prounouncable symbols.

How the hell that translates into designing a rolodex system is beyond me. Hahaha.

It's easy enough to disable the rolodex for unsupported languages I guess, but I'm wondering if a roledex would be possible for chinese, etc.

Is an email address (in chinese) simply a series of symbols which when pronounced for a word or sorts?

Edit: If Chinese consists solely of say 2000 symbols and those symbols are used for communicating EVERYTHING...I'm so confused as to how that is possible. In English there are litterally an infinite amount of character combinations which form words. But in Chinese there are a fixed number of symbols which are analogous to words???

p.s-If anyone can answer my P.S I'd appreciate it. :)

Cheers :)

Posted: Tue Oct 16, 2007 3:52 am
by Kieran Huggins
From what I understand (e.g. my last girlfriend was Chinese):

Chinese uses a base of a few thousand characters, each one has a simple meaning, and is for all intents and purposes a word. Many Chinese words are expressed as a combination of these base symbols. As an example, "owl" is "猫头鹰", which is essentially means "cat head bird"*, or "bird with the head of a cat". Most words in Chinese follow a similar pattern, as compounds of simpler concepts.

With that in mind, the reason a rolodex doesn't work so well for languages like this becomes obvious.

Email addresses, being based on the Latin charset (roman characters only) are inherently not compatible with either Traditional or Simplified Chinese.

More Chinese facts: the written languages (there are several variants) while mapping to the spoken language in one sense, are not phonetic. The two major spoken languages (Mandarin and Cantonese) are different dialects of the same language. There are severe pronunciation differences to the point where you can speak one and not the other. Japanese (both spoken and written) are arguably a third dialect, though extremely distant. Add on to that several other dialects (Taiwanese, Shanghainese, etc) and you start to appreciate the complexity to an even greater degree.

Chinese, when written phonetically using Roman characters, is a very poor representation of the spoken language. Wang, Wong, and Huang are all the same Chinese name, simply interpreted differently, as there is no real match in the western sound set. The same is also true for the ch/zh/x variants of many other words.

*incidentally, the Chinese word for "owl" above literally translates back to English as "Cat Tao Eagle". While Tao is generally considered to be "philosophy" or "way" or "path", it also alludes to the head. This kind of "cultural allusion" is yet another barrier in simple translation. "Eagle" is simply the bird that got the default, unqualified label as "bird". All other birds are qualified by some sort of property (north, fat, blue, etc...)

You could fish an ocean dry with that can of worms!

Posted: Tue Oct 16, 2007 2:41 pm
by alex.barylski
Hey Kieran,

Thanks for that. That was interesting. Language and NLP have always been an interest to me (although admittedly I've studied mostly English in regard to NLP).

One thing you said really caugtht my interest:
Email addresses, being based on the Latin charset (roman characters only) are inherently not compatible with either Traditional or Simplified Chinese
Does this mean a rolodex would work for email addresses as they are all essentially forced into using latin character sets?

Posted: Tue Oct 16, 2007 6:05 pm
by Kieran Huggins
It sure would! Why didn't I think of that? *smacks head*

All URI based information (domain names, email addresses, etc...) must follow the URI spec, which includes latin-1 only. ICANN is testing the extension to this restriction now, but only for characters like é and ü. They claim to have no plans to extend it any further either now or at any point in the future. Bad move, IMO, since it's inevitable and will only serve to fragment organization/authority of domains.