Page 1 of 1

language detection?

Posted: Mon Oct 29, 2007 12:57 pm
by nathanr
Evening,

I'm here to pick the brains of all you guru's out there..

The Data: all utf-8, however covers many languages, often as is the way, the lanuages is a mix of native and english (ie russian with english nouns, german with english nouns.. and so on)

Needed: need to be able to detect which language the content is, and we have no info to play with other than the chars (which are all utf-8), it's all just string content, no html, no meta data etc etc..

Any ideas.. any solution concidered, any linux avaliable languages, or even an api..

Many Thanks in advance.

nath

Posted: Mon Oct 29, 2007 1:27 pm
by feyd
If memory serves, in the UTF8 specification each language is assigned code pages. You could write something which looks up which code pages are used thereby giving you which languages are used.

I don't use third party libraries often, so I can't recommend any to use, sorry.

Posted: Mon Oct 29, 2007 1:30 pm
by nathanr
cheers feyd, that's what I thought aswell to be honest - I was just hoping that you might know some quicker way. Thanks again for the response / confirmation.

Nathan

Posted: Mon Oct 29, 2007 1:33 pm
by feyd
It would run pretty quick, but getting such a function (or class) written will take a bit of time.

Posted: Sat Nov 03, 2007 4:47 pm
by nathanr
just a quick update, I've yet to write a full class for this, however I've found the followign most useful:

Code: Select all

<?php
$entitycontent = mb_encode_numericentity($content, array(0x0, 0x2FFFF, 0, 0xFFFF), 'UTF-8');

#simple preg_match example: preg_match('/&#10[78]\d/', $entitycontent)
?>
this will turn all utf-8 char's into there html-entities, thus allowing you to do simple preg_match's to detect which charset the content is, hardly infallable (perhaps something where if X% of content is in charset X then content is language X)

Posted: Sat Nov 03, 2007 5:03 pm
by bokehman
If you have a block of text of reasonable size the easiest way to recognize the language is through a chi-squared distribution test.