Evening,
I'm here to pick the brains of all you guru's out there..
The Data: all utf-8, however covers many languages, often as is the way, the lanuages is a mix of native and english (ie russian with english nouns, german with english nouns.. and so on)
Needed: need to be able to detect which language the content is, and we have no info to play with other than the chars (which are all utf-8), it's all just string content, no html, no meta data etc etc..
Any ideas.. any solution concidered, any linux avaliable languages, or even an api..
Many Thanks in advance.
nath
language detection?
Moderator: General Moderators
just a quick update, I've yet to write a full class for this, however I've found the followign most useful:
this will turn all utf-8 char's into there html-entities, thus allowing you to do simple preg_match's to detect which charset the content is, hardly infallable (perhaps something where if X% of content is in charset X then content is language X)
Code: Select all
<?php
$entitycontent = mb_encode_numericentity($content, array(0x0, 0x2FFFF, 0, 0xFFFF), 'UTF-8');
#simple preg_match example: preg_match('/
[78]\d/', $entitycontent)
?>