language detection?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
nathanr
Forum Contributor
Posts: 200
Joined: Wed Jun 07, 2006 5:46 pm

language detection?

Post by nathanr »

Evening,

I'm here to pick the brains of all you guru's out there..

The Data: all utf-8, however covers many languages, often as is the way, the lanuages is a mix of native and english (ie russian with english nouns, german with english nouns.. and so on)

Needed: need to be able to detect which language the content is, and we have no info to play with other than the chars (which are all utf-8), it's all just string content, no html, no meta data etc etc..

Any ideas.. any solution concidered, any linux avaliable languages, or even an api..

Many Thanks in advance.

nath
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

If memory serves, in the UTF8 specification each language is assigned code pages. You could write something which looks up which code pages are used thereby giving you which languages are used.

I don't use third party libraries often, so I can't recommend any to use, sorry.
User avatar
nathanr
Forum Contributor
Posts: 200
Joined: Wed Jun 07, 2006 5:46 pm

Post by nathanr »

cheers feyd, that's what I thought aswell to be honest - I was just hoping that you might know some quicker way. Thanks again for the response / confirmation.

Nathan
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

It would run pretty quick, but getting such a function (or class) written will take a bit of time.
User avatar
nathanr
Forum Contributor
Posts: 200
Joined: Wed Jun 07, 2006 5:46 pm

Post by nathanr »

just a quick update, I've yet to write a full class for this, however I've found the followign most useful:

Code: Select all

<?php
$entitycontent = mb_encode_numericentity($content, array(0x0, 0x2FFFF, 0, 0xFFFF), 'UTF-8');

#simple preg_match example: preg_match('/&#10[78]\d/', $entitycontent)
?>
this will turn all utf-8 char's into there html-entities, thus allowing you to do simple preg_match's to detect which charset the content is, hardly infallable (perhaps something where if X% of content is in charset X then content is language X)
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

If you have a block of text of reasonable size the easiest way to recognize the language is through a chi-squared distribution test.
Post Reply