How to exclude some words from badwords filter?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
MicroBoy
Forum Contributor
Posts: 112
Joined: Sat Mar 14, 2009 5:16 pm

How to exclude some words from badwords filter?

Post by MicroBoy »

I have a function which filters some bad words, I do not use word boundaries on it, but there are some words that contain some bad words but they are not bad, as for example Essex, I do not want to filter essex and block the submitting of content, I want to block all other words that contain se* excluding essex.
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: How to exclude some words from badwords filter?

Post by ragax »

You say you don't use word boundaries. Do you mean that you don't want to use:

Code: Select all

\bsax\b
which would match sax, but not Essax, saxual etc?

Maybe I am misunderstanding the purpose of your question.
If you can clarify (maybe what regex flavor you are using, and why not word boundaries), I'll be delighted to take a crack at it. :)
MicroBoy
Forum Contributor
Posts: 112
Joined: Sat Mar 14, 2009 5:16 pm

Re: How to exclude some words from badwords filter?

Post by MicroBoy »

Yep I do not want to use \bsax\b because then users can write for example saxx and the system will not block this word, I want to block the submit if the content contains word with sax like sax, saxx, newsaxx, saxest etc... but not for example essax. I hope I was clear.
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: How to exclude some words from badwords filter?

Post by ragax »

Ah, I see... So you want to explicitly allow Essax.
(Do you have a list of words you want to explicitly allow?)

If I understood, this expression should do the trick. It contains a negative lookbehind that checks for the letters "es" before "sax".

Code: Select all

(?i)(?<!es)sax
So "saxual" and "saxx" will be matched, but not "essax".

In case you are not familiar with it, (?i) is an inline flag that turns on the case insensitive mode. Of course in a preg application of the regex you could also turn it on for the whole expression by appending "i" to the right of the closing delimiter, as in:

Code: Select all

$regex=',(?<!es)sax,i';
Are we getting close?
I am quite curious to see how this expression develops. In particular what tickles me is the question of the whitelist. Because there is no easy rule to distinguish "saxy" from "saxophone".
Looking forward to your next message.
MicroBoy
Forum Contributor
Posts: 112
Joined: Sat Mar 14, 2009 5:16 pm

Re: How to exclude some words from badwords filter?

Post by MicroBoy »

I have the list of white words, like Essax, benedick etc... so they are certain words in a list.

Thanks a lot for you time.
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: How to exclude some words from badwords filter?

Post by ragax »

It's a pleasure. Fun to see what you're using regex for.
MicroBoy
Forum Contributor
Posts: 112
Joined: Sat Mar 14, 2009 5:16 pm

Re: How to exclude some words from badwords filter?

Post by MicroBoy »

Yes but do you know how to use with a white list, with exactly words?

I use this now, but I need something to find bad words excluding white words, I need something that my client could write all whitewords in a variable $whitewords:

Code: Select all

preg_match_all("/(" . $badWords . ")/i", $content, $matches_a);
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: How to exclude some words from badwords filter?

Post by ragax »

Hi MicroBoy,

For me, at this stage it's no longer a regex question, but a coding question. But I might be wrong.

Let's say your client gives you "Essax" and "Middlesax" as words for the white list. And a hundred others. Now to apply the regex I gave you for Essax, you're going to need another regex to help you parse the white list and write the appropriate lookaheads and lookbehinds into an expression.
And you will need one expression for "sax", one for "duck", and so on.
So you have an array of bad words. For each bad word, an array of white words. For instance:

Code: Select all

$bad[0][0]='sax';
$bad[0][1]='essax';
$bad[0][2]='middlesax';
$bad[1][0]='duck';
$bad[1][1]='donaldduck';
At some stage (hopefully during construction, not at run time), you prepare a list of expressions to dump in your code. For that, you write a function that iterates over $bad and builds you a series of regular expressions to use in your final code.

That's one way to do it. But I haven't done it before, and I don't know how this kind of feature is implemented in real life. That's why I was suggesting that it might be a coding question at this stage. A few other ideas:

- for inspiration, you could look at SMF (the Simple Machines Forum), which this very forum looks to be built on. It is open source and has a bad word filter that you can probably dig up.
- you may find a script already made, maybe on hostscripts.
- or even a PEAR library with the perfect function, to which you can pass your array.
- bad word filter, google is your friend.

Wishing you a beautiful day,

-A
MicroBoy
Forum Contributor
Posts: 112
Joined: Sat Mar 14, 2009 5:16 pm

Re: How to exclude some words from badwords filter?

Post by MicroBoy »

Just wanted to inform and others that have the same issue that I found the solution, using preg_replace I first replaced all words from the white list that are in content with #$$$# and then checked the content so basically the content was free from white list words and it did not marked as badwords cause they are replaced with #$$$#. I hope I was clear.

Best wishes.
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: How to exclude some words from badwords filter?

Post by ragax »

Sweet. So simple. Thanks for sharing.
Post Reply