Word counting for Japanese characters?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Word counting for Japanese characters?

Post by requinix »

That's gonna be hard. Japanese uses meanings a lot more than we do: a sequence of characters doesn't necessarily represent a "word".

Unless you redefine what a word is...
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Word counting for Japanese characters?

Post by Eran »

This is a complicated subject, since as you said, Japanese don't use spaces as separators. There are other separators though, check out this discussion from a forum for a web-spidering tool in PHP regarding tokenizing japanese text into words -
http://www.phpdig.net/forum/archive/ind ... t-355.html
The other approach would be brute-forcing it using a dictionary.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Word counting for Japanese characters?

Post by Weirdan »

gromlok wrote:The problem is that the Japanese does not have a word delimiter like the western alphabet, this is, the white space.
Then how do you define 'word' in Japanese?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Word counting for Japanese characters?

Post by Eran »

The same way you do in English, they just use spaces very sparingly in sentences. IfIwrotethesameinEnglish, you would be able to read it since you know the words. Sometimes they put half-spaces (very small) between them. It might seem confusing, but it gets less so when you get used to it
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Word counting for Japanese characters?

Post by Eran »

First I must standardized the encoding of Japanese
Actually, this is much less of a problem than they made of it in that forum post. First of all, if you own the content then you should know the encoding, no standardization required. If you are scraping other sites (as they do on that forum), you can use the iconv() function to encode everything into utf-8. I've used it successfully with most Japanese encodings in the past.
Post Reply