Analyse and categorize information

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
imac
Forum Newbie
Posts: 6
Joined: Wed Jan 11, 2012 12:57 pm

Analyse and categorize information

Post by imac »

Hello guys,

I am trying to categorize some texts into different categories. (sports, economy, politics, tecnology...etc)
I want the system to do it by itself. Without any human interaction.

I was wondering if you have any idea about how could i deal with this problem.

I was thinking about looking for some words or group of word that would define each category, but sometimes this words might not appear in the texts...
So, any other idea?

I know that it has been implemented in some websites as: http://paper.li (it reads tweets categorizing them)

Thanks.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Analyse and categorize information

Post by Christopher »

You probably need to compare the texts to a large number of words and phrases and give it a score based on the number of matches and probably combinations of terms used together. For example, the word "compete" might be in articles about sports, economy and politics, whereas other terms would be more specific to only one area.

Perhaps seeding the system from the other direction might make sense. Do word counts on many articles in many subjects to find the unique/rare words for each subject.
(#10850)
imac
Forum Newbie
Posts: 6
Joined: Wed Jan 11, 2012 12:57 pm

Re: Analyse and categorize information

Post by imac »

Thanks for the answer Christopher. What you say is completely correct.
I think i found out the solution for this. It uses probability as you said and it works with a previous train with some texts.

Its called "Naive Bayesian" classifier.
I am gonna use this one i have found implemented in PHP:
http://www.phpclasses.org/package/5072- ... ethod.html

It just need some little changes.

Anyway, if you guys are interested on it there's another PHP implementation here:
http://www.xhtml.net/php/PHPNaiveBayesianFilter

Thanks.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Analyse and categorize information

Post by Christopher »

Thanks for posting the information you found back to the forum. Please let us know how the project goes. Good luck!
(#10850)
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Analyse and categorize information

Post by Mordred »

Naive Bayes classifiers do not work without human input though (you need a training set) and the general accuracy is not very good (although for that easiness of implementation and amount of training, the results are quite good).

If you really want 100% unsupervised classification, k-means clustering might be a better choice (well, to use as classifier you'll need K examples so you can properly label the K clusters, but apart from that it's really unsupervised machine learning)
imac
Forum Newbie
Posts: 6
Joined: Wed Jan 11, 2012 12:57 pm

Re: Analyse and categorize information

Post by imac »

What do you mean with a cluster? I dont get it.

Thanks.
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Analyse and categorize information

Post by Mordred »

imac
Forum Newbie
Posts: 6
Joined: Wed Jan 11, 2012 12:57 pm

Re: Analyse and categorize information

Post by imac »

Ok, i see.

But... if i have some categories defined by me (lets say, politics, sports, economy). And i have some text which i know they fit in one of those categories, isn't it more simple to use the bayesian method as i have categories and text examples already defined?

It won't be completely automatic as i have to train it at first, but then... it will. And maybe more accurate than with clusters. Right?

Thanks.
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Analyse and categorize information

Post by Mordred »

Accuracy in machine learning largely depends on the algorithm and training set.

I've implemented a naive bayes classifier before and for topic classification it has ~85% accuracy on some newsgroup data I took from here: http://www.cs.cmu.edu/~tom/book.html (with 2/3 data used for training and 1/3 for testing). For "harder" tasks (authorship determination for forum posts) the accuracy was ~53% way too low for real world use.

I haven't tried clustering for text classification (and I'm too lazy to search for other people's results) but I believe it will be better than that. Naive bayes is called "naive" for a reason :).
Post Reply