Page 1 of 1

Analyse and categorize information

Posted: Wed Jan 11, 2012 1:03 pm
by imac
Hello guys,

I am trying to categorize some texts into different categories. (sports, economy, politics, tecnology...etc)
I want the system to do it by itself. Without any human interaction.

I was wondering if you have any idea about how could i deal with this problem.

I was thinking about looking for some words or group of word that would define each category, but sometimes this words might not appear in the texts...
So, any other idea?

I know that it has been implemented in some websites as: http://paper.li (it reads tweets categorizing them)

Thanks.

Re: Analyse and categorize information

Posted: Wed Jan 11, 2012 5:32 pm
by Christopher
You probably need to compare the texts to a large number of words and phrases and give it a score based on the number of matches and probably combinations of terms used together. For example, the word "compete" might be in articles about sports, economy and politics, whereas other terms would be more specific to only one area.

Perhaps seeding the system from the other direction might make sense. Do word counts on many articles in many subjects to find the unique/rare words for each subject.

Re: Analyse and categorize information

Posted: Thu Jan 12, 2012 3:31 am
by imac
Thanks for the answer Christopher. What you say is completely correct.
I think i found out the solution for this. It uses probability as you said and it works with a previous train with some texts.

Its called "Naive Bayesian" classifier.
I am gonna use this one i have found implemented in PHP:
http://www.phpclasses.org/package/5072- ... ethod.html

It just need some little changes.

Anyway, if you guys are interested on it there's another PHP implementation here:
http://www.xhtml.net/php/PHPNaiveBayesianFilter

Thanks.

Re: Analyse and categorize information

Posted: Thu Jan 12, 2012 4:03 pm
by Christopher
Thanks for posting the information you found back to the forum. Please let us know how the project goes. Good luck!

Re: Analyse and categorize information

Posted: Mon Jan 16, 2012 9:32 am
by Mordred
Naive Bayes classifiers do not work without human input though (you need a training set) and the general accuracy is not very good (although for that easiness of implementation and amount of training, the results are quite good).

If you really want 100% unsupervised classification, k-means clustering might be a better choice (well, to use as classifier you'll need K examples so you can properly label the K clusters, but apart from that it's really unsupervised machine learning)

Re: Analyse and categorize information

Posted: Mon Jan 16, 2012 9:52 am
by imac
What do you mean with a cluster? I dont get it.

Thanks.

Re: Analyse and categorize information

Posted: Mon Jan 16, 2012 9:55 am
by Mordred

Re: Analyse and categorize information

Posted: Mon Jan 16, 2012 11:36 am
by imac
Ok, i see.

But... if i have some categories defined by me (lets say, politics, sports, economy). And i have some text which i know they fit in one of those categories, isn't it more simple to use the bayesian method as i have categories and text examples already defined?

It won't be completely automatic as i have to train it at first, but then... it will. And maybe more accurate than with clusters. Right?

Thanks.

Re: Analyse and categorize information

Posted: Mon Jan 16, 2012 11:47 am
by Mordred
Accuracy in machine learning largely depends on the algorithm and training set.

I've implemented a naive bayes classifier before and for topic classification it has ~85% accuracy on some newsgroup data I took from here: http://www.cs.cmu.edu/~tom/book.html (with 2/3 data used for training and 1/3 for testing). For "harder" tasks (authorship determination for forum posts) the accuracy was ~53% way too low for real world use.

I haven't tried clustering for text classification (and I'm too lazy to search for other people's results) but I believe it will be better than that. Naive bayes is called "naive" for a reason :).