Page 1 of 1
How to exclude some words from badwords filter?
Posted: Thu Dec 15, 2011 3:51 pm
by MicroBoy
I have a function which filters some bad words, I do not use word boundaries on it, but there are some words that contain some bad words but they are not bad, as for example Essex, I do not want to filter essex and block the submitting of content, I want to block all other words that contain se* excluding essex.
Re: How to exclude some words from badwords filter?
Posted: Thu Dec 15, 2011 4:43 pm
by ragax
You say you don't use word boundaries. Do you mean that you don't want to use:
which would match sax, but not Essax, saxual etc?
Maybe I am misunderstanding the purpose of your question.
If you can clarify (maybe what regex flavor you are using, and why not word boundaries), I'll be delighted to take a crack at it.

Re: How to exclude some words from badwords filter?
Posted: Mon Dec 19, 2011 4:04 pm
by MicroBoy
Yep I do not want to use \bsax\b because then users can write for example saxx and the system will not block this word, I want to block the submit if the content contains word with sax like sax, saxx, newsaxx, saxest etc... but not for example essax. I hope I was clear.
Re: How to exclude some words from badwords filter?
Posted: Mon Dec 19, 2011 4:19 pm
by ragax
Ah, I see... So you want to explicitly allow Essax.
(Do you have a list of words you want to explicitly allow?)
If I understood, this expression should do the trick. It contains a negative lookbehind that checks for the letters "es" before "sax".
So "saxual" and "saxx" will be matched, but not "essax".
In case you are not familiar with it, (?i) is an inline flag that turns on the case insensitive mode. Of course in a preg application of the regex you could also turn it on for the whole expression by appending "i" to the right of the closing delimiter, as in:
Are we getting close?
I am quite curious to see how this expression develops. In particular what tickles me is the question of the whitelist. Because there is no easy rule to distinguish "saxy" from "saxophone".
Looking forward to your next message.
Re: How to exclude some words from badwords filter?
Posted: Tue Dec 20, 2011 9:42 am
by MicroBoy
I have the list of white words, like Essax, benedick etc... so they are certain words in a list.
Thanks a lot for you time.
Re: How to exclude some words from badwords filter?
Posted: Tue Dec 20, 2011 1:53 pm
by ragax
It's a pleasure. Fun to see what you're using regex for.
Re: How to exclude some words from badwords filter?
Posted: Tue Dec 20, 2011 3:15 pm
by MicroBoy
Yes but do you know how to use with a white list, with exactly words?
I use this now, but I need something to find bad words excluding white words, I need something that my client could write all whitewords in a variable $whitewords:
Code: Select all
preg_match_all("/(" . $badWords . ")/i", $content, $matches_a);
Re: How to exclude some words from badwords filter?
Posted: Tue Dec 20, 2011 3:54 pm
by ragax
Hi MicroBoy,
For me, at this stage it's no longer a regex question, but a coding question. But I might be wrong.
Let's say your client gives you "Essax" and "Middlesax" as words for the white list. And a hundred others. Now to apply the regex I gave you for Essax, you're going to need another regex to help you parse the white list and write the appropriate lookaheads and lookbehinds into an expression.
And you will need one expression for "sax", one for "duck", and so on.
So you have an array of bad words. For each bad word, an array of white words. For instance:
Code: Select all
$bad[0][0]='sax';
$bad[0][1]='essax';
$bad[0][2]='middlesax';
$bad[1][0]='duck';
$bad[1][1]='donaldduck';
At some stage (hopefully during construction, not at run time), you prepare a list of expressions to dump in your code. For that, you write a function that iterates over $bad and builds you a series of regular expressions to use in your final code.
That's one way to do it. But I haven't done it before, and I don't know how this kind of feature is implemented in real life. That's why I was suggesting that it might be a coding question at this stage. A few other ideas:
- for inspiration, you could look at SMF (the Simple Machines Forum), which this very forum looks to be built on. It is open source and has a bad word filter that you can probably dig up.
- you may find a script already made, maybe on hostscripts.
- or even a PEAR library with the perfect function, to which you can pass your array.
- bad word filter, google is your friend.
Wishing you a beautiful day,
-A
Re: How to exclude some words from badwords filter?
Posted: Wed Dec 21, 2011 12:17 pm
by MicroBoy
Just wanted to inform and others that have the same issue that I found the solution, using preg_replace I first replaced all words from the white list that are in content with #$$$# and then checked the content so basically the content was free from white list words and it did not marked as badwords cause they are replaced with #$$$#. I hope I was clear.
Best wishes.
Re: How to exclude some words from badwords filter?
Posted: Wed Dec 21, 2011 12:56 pm
by ragax
Sweet. So simple. Thanks for sharing.