language detection?
Posted: Mon Oct 29, 2007 12:57 pm
Evening,
I'm here to pick the brains of all you guru's out there..
The Data: all utf-8, however covers many languages, often as is the way, the lanuages is a mix of native and english (ie russian with english nouns, german with english nouns.. and so on)
Needed: need to be able to detect which language the content is, and we have no info to play with other than the chars (which are all utf-8), it's all just string content, no html, no meta data etc etc..
Any ideas.. any solution concidered, any linux avaliable languages, or even an api..
Many Thanks in advance.
nath
I'm here to pick the brains of all you guru's out there..
The Data: all utf-8, however covers many languages, often as is the way, the lanuages is a mix of native and english (ie russian with english nouns, german with english nouns.. and so on)
Needed: need to be able to detect which language the content is, and we have no info to play with other than the chars (which are all utf-8), it's all just string content, no html, no meta data etc etc..
Any ideas.. any solution concidered, any linux avaliable languages, or even an api..
Many Thanks in advance.
nath