That's gonna be hard. Japanese uses meanings a lot more than we do: a sequence of characters doesn't necessarily represent a "word".
Unless you redefine what a word is...
Word counting for Japanese characters?
Moderator: General Moderators
Re: Word counting for Japanese characters?
This is a complicated subject, since as you said, Japanese don't use spaces as separators. There are other separators though, check out this discussion from a forum for a web-spidering tool in PHP regarding tokenizing japanese text into words -
http://www.phpdig.net/forum/archive/ind ... t-355.html
The other approach would be brute-forcing it using a dictionary.
http://www.phpdig.net/forum/archive/ind ... t-355.html
The other approach would be brute-forcing it using a dictionary.
Re: Word counting for Japanese characters?
Then how do you define 'word' in Japanese?gromlok wrote:The problem is that the Japanese does not have a word delimiter like the western alphabet, this is, the white space.
Re: Word counting for Japanese characters?
The same way you do in English, they just use spaces very sparingly in sentences. IfIwrotethesameinEnglish, you would be able to read it since you know the words. Sometimes they put half-spaces (very small) between them. It might seem confusing, but it gets less so when you get used to it
Re: Word counting for Japanese characters?
Actually, this is much less of a problem than they made of it in that forum post. First of all, if you own the content then you should know the encoding, no standardization required. If you are scraping other sites (as they do on that forum), you can use the iconv() function to encode everything into utf-8. I've used it successfully with most Japanese encodings in the past.First I must standardized the encoding of Japanese