Page 1 of 1
Beyes filter - spam filtering
Posted: Wed Jan 28, 2009 10:49 am
by Wolfie
Hello all!
It's my first post here.
I am trying to figure out how the bayes filters for spam are working and I want to programm it in php.
The thing is that I can't find any more detailed articles than this :
Implement Bayesian Inference
A plan for spam
The thing is I don't know how I can calculate probability for each new and also already existing token - who knows the topic knows what I am talking about
I will be grateful for help
Re: Beyes filter - spam filtering
Posted: Thu Jan 29, 2009 5:54 pm
by josh
You don't calculate the probability in real time, as you train the filter it learns new logical branches in the domain of knowledge. For instance if you started with an empty knowledge base ( KB ) and told the KB online casino = spam the probability would be:
online casino => spam = 100%
online => spam 100%
casino => spam 100%
Then if you told the KB online ! => ( does not imply ) spam, and casino ! => spam, the table would look more like:
online casino => spam 100%
online => spam 50%
casino => spam 50%
Notice how online and casino in their unary forms now contain 50% probability, even tho you told the KB they did not mean spam, its "absorbing" weighting from the other term. Now if you "asked" the KB if "online casino" was spam it should return 100%, if you asked the KB if "online gambling" was spam it should return 50%, lets say your threshold was 75%, then this message would be classified as nonspam. If the user then told the KB that online gambling WAS spam, the KB would probably be updated as follows
online casino => spam 100%
online gambling => spam 100%
online => spam 60%
casino => spam 50%
gambling => spam 100%
Over time your filter would "learn" the logical branches of your domain ( what makes something spam vs what makes it not spam ). Features that occur in both spam and non spam will be counted less, due to using probabilistic reasoning. Hope this helps.
Re: Beyes filter - spam filtering
Posted: Fri Jan 30, 2009 9:02 am
by Wolfie
Sure, this is clear.
After studing some math I made an example:
A-email is spam
B-"free" word occured
now the probability that email is spam when word "free" occured will be given by formula:
P(A | B) = {A & B} / {B}
So let's say that word "free" occured in my emails 20 times ({B}), and 15 emails of those was spam ({A & B}) , than the calculation will looks like this:
P(A | B) = 15/20 = 0.75
It means that the probablity that the email is spam when word "free" occurs is 75%.
Now the challange is how to design the Knowledge Base ?
And if this kind of calculation will be enought to build bayes filter? Is it possible that it is so simple ?!
End another question:
What I should do if in one e-mail spam-word will occure let's say 3 times ?
And if I have lots of diferent spam words in email should I calculate avarage from they probabilities and than decide if the message is spam or should I calculate probability form probabilities ?

Re: Beyes filter - spam filtering
Posted: Fri Jan 30, 2009 1:43 pm
by josh
The way spamassasin works it extracts "features" from the document, and uses that as the rows for the bayesian "matrix" of probabilities. Features could be an atomic ( single ) word, or something like the fact that more than 50% of the message is in capslock. Once you have your probabilities you use bayes theorem to derive P( A | B ) from P( B | A )
Knowledge representation / inferences need propositional logic or first order logic. That's probably not what you want. I recommend this book ( co-authored by the guys in charge of this stuff at nasa and google ): "Artificial intelligence a modern approach"
Re: Beyes filter - spam filtering
Posted: Sun Feb 01, 2009 8:21 am
by Wolfie
If u r telling me that it's not what I want why did you recomend me this book ?
I get the book and I have read about the propositional logic and first order logic but what it has in common with probabilistic methods of bayes ?
Should I predicate if the mail is spam using true or false statements ?
I don't get it.
Re: Beyes filter - spam filtering
Posted: Sun Feb 01, 2009 11:05 am
by josh
Wolfie wrote:If u r telling me that it's not what I want why did you recomend me this book ?
I get the book
Apparently you must have overlooked sections V and VI, for instance chapter 14 entitled "probabilistic reasoning", or chapter chapter 20 "statistical learning methods", chapter 19 "knowledge in learning". You were the one who brought up knowledge representation, I said if you want to make a basic spam filter that is not what you want. I'm not going to sit here and teach thousands of years worth of CS, so if you don't want to read the book thats your prerogative.
Wolfie wrote: and I have read about the propositional logic and first order logic but what it has in common with probabilistic methods of bayes ?
They all are types of constraint satisfaction
Wolfie wrote: Should I predicate if the mail is spam using true or false statements ?
I don't get it.
Um well like I said thats probably not what you want, but it would look something like +online ^ +casino => spam, -online ^ +casino => not spam
Thing is you asked specifically about probabilistic reasoning. For that you want fuzzy logic not first order logic, perhaps you need to reread the book in more detail and decide these things for yourself. I actually don't believe you that you read it. Maybe you read OF it, or perhaps even skimmed some sections?
Re: Beyes filter - spam filtering
Posted: Sun Feb 01, 2009 12:32 pm
by Wolfie
Of course I didn't read all the book.
Just chapter 3. All about propositional logic and I have rewieved first order logic, also in wikipedia.
Now when u told me which chapters take attention to, it will be easyer to cach the point.
Um well like I said thats probably not what you want, but it would look something like +online ^ +casino => spam, -online ^ +casino => not spam
Ok. Now I get it more clear what you mean.
And of course I want to read the book, but it's quite large, and I don't have time to read it all
One more thing. My english is not on high level, and sometimes I have to read your sentence couple of times, maybe that's why we have little problems with communication

Re: Beyes filter - spam filtering
Posted: Sun Feb 01, 2009 4:33 pm
by josh
Yeah, the Language barrier, I know.. you found it on GOogle books or something? Just wondering cuz that was fast. I had to wait like a week for mine. And there is also more concise book "collective intelligence" which touches on the subject of nueral networks and bayes. I would definitely only concern yourself with the relevant chapter. I've read the first 10 chapters, but I'm reading it front to back. So far everything builds on other stuff, the bayes concept is pretty basic but I think if you took a few months to really read thru the book you will get a clearer picture of how to use this stuff. They go thru excruciating detail but if you have the patience your work will pay off.
Re: Beyes filter - spam filtering
Posted: Sat Feb 07, 2009 11:26 pm
by Wolfie
Well....yes, I get it through internet

Moreover, on the beggining there is writen that you can get it only in US and Canada.
Acctually I stoped reading this book.....I have found some information about implementing Bayes through SpamAssasin documentation. The informations are focuesd on the spam filtering so it is more comfortable for me under time pressure
Thanks for help
Re: Beyes filter - spam filtering
Posted: Sat Feb 07, 2009 11:44 pm
by josh
Umm If I had known your end goal was a working spam filter, and not AI research I would have told you that originally

procmail + spamassasin + sendmail !! ( if you can deal with m4 )
Re: Beyes filter - spam filtering
Posted: Mon Feb 09, 2009 6:31 am
by Wolfie
Yes....I need spam filter, but not ready one cause I am doing my master about spam filtering. The topic is "Web-based mail client with adaptive anti-spam filter"
So I am working on it

Re: Beyes filter - spam filtering
Posted: Mon Feb 09, 2009 7:36 pm
by josh
You're definitely correct in looking into bayes. Good luck.