PHP Developers Network
http://forums.devnetwork.net/

How to exclude some words from badwords filter?
http://forums.devnetwork.net/viewtopic.php?f=38&t=133488
Page 1 of 1

Author:  MicroBoy [ Thu Dec 15, 2011 4:51 pm ]
Post subject:  How to exclude some words from badwords filter?

I have a function which filters some bad words, I do not use word boundaries on it, but there are some words that contain some bad words but they are not bad, as for example Essex, I do not want to filter essex and block the submitting of content, I want to block all other words that contain se* excluding essex.

Author:  ragax [ Thu Dec 15, 2011 5:43 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

You say you don't use word boundaries. Do you mean that you don't want to use:

Syntax: [ Download ] [ Hide ]
\bsax\b


which would match sax, but not Essax, saxual etc?

Maybe I am misunderstanding the purpose of your question.
If you can clarify (maybe what regex flavor you are using, and why not word boundaries), I'll be delighted to take a crack at it. :)

Author:  MicroBoy [ Mon Dec 19, 2011 5:04 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

Yep I do not want to use \bsax\b because then users can write for example saxx and the system will not block this word, I want to block the submit if the content contains word with sax like sax, saxx, newsaxx, saxest etc... but not for example essax. I hope I was clear.

Author:  ragax [ Mon Dec 19, 2011 5:19 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

Ah, I see... So you want to explicitly allow Essax.
(Do you have a list of words you want to explicitly allow?)

If I understood, this expression should do the trick. It contains a negative lookbehind that checks for the letters "es" before "sax".

Syntax: [ Download ] [ Hide ]
(?i)(?<!es)sax


So "saxual" and "saxx" will be matched, but not "essax".

In case you are not familiar with it, (?i) is an inline flag that turns on the case insensitive mode. Of course in a preg application of the regex you could also turn it on for the whole expression by appending "i" to the right of the closing delimiter, as in:

Syntax: [ Download ] [ Hide ]
$regex=',(?<!es)sax,i';


Are we getting close?
I am quite curious to see how this expression develops. In particular what tickles me is the question of the whitelist. Because there is no easy rule to distinguish "saxy" from "saxophone".
Looking forward to your next message.

Author:  MicroBoy [ Tue Dec 20, 2011 10:42 am ]
Post subject:  Re: How to exclude some words from badwords filter?

I have the list of white words, like Essax, benedick etc... so they are certain words in a list.

Thanks a lot for you time.

Author:  ragax [ Tue Dec 20, 2011 2:53 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

It's a pleasure. Fun to see what you're using regex for.

Author:  MicroBoy [ Tue Dec 20, 2011 4:15 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

Yes but do you know how to use with a white list, with exactly words?

I use this now, but I need something to find bad words excluding white words, I need something that my client could write all whitewords in a variable $whitewords:
Syntax: [ Download ] [ Hide ]
preg_match_all("/(" . $badWords . ")/i", $content, $matches_a);

Author:  ragax [ Tue Dec 20, 2011 4:54 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

Hi MicroBoy,

For me, at this stage it's no longer a regex question, but a coding question. But I might be wrong.

Let's say your client gives you "Essax" and "Middlesax" as words for the white list. And a hundred others. Now to apply the regex I gave you for Essax, you're going to need another regex to help you parse the white list and write the appropriate lookaheads and lookbehinds into an expression.
And you will need one expression for "sax", one for "duck", and so on.
So you have an array of bad words. For each bad word, an array of white words. For instance:

Syntax: [ Download ] [ Hide ]
$bad[0][0]='sax';
$bad[0][1]='essax';
$bad[0][2]='middlesax';
$bad[1][0]='duck';
$bad[1][1]='donaldduck';
 


At some stage (hopefully during construction, not at run time), you prepare a list of expressions to dump in your code. For that, you write a function that iterates over $bad and builds you a series of regular expressions to use in your final code.

That's one way to do it. But I haven't done it before, and I don't know how this kind of feature is implemented in real life. That's why I was suggesting that it might be a coding question at this stage. A few other ideas:

- for inspiration, you could look at SMF (the Simple Machines Forum), which this very forum looks to be built on. It is open source and has a bad word filter that you can probably dig up.
- you may find a script already made, maybe on hostscripts.
- or even a PEAR library with the perfect function, to which you can pass your array.
- bad word filter, google is your friend.

Wishing you a beautiful day,

-A

Author:  MicroBoy [ Wed Dec 21, 2011 1:17 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

Just wanted to inform and others that have the same issue that I found the solution, using preg_replace I first replaced all words from the white list that are in content with #$$$# and then checked the content so basically the content was free from white list words and it did not marked as badwords cause they are replaced with #$$$#. I hope I was clear.

Best wishes.

Author:  ragax [ Wed Dec 21, 2011 1:56 pm ]
Post subject:  Re: How to exclude some words from badwords filter?

Sweet. So simple. Thanks for sharing.

Page 1 of 1 All times are UTC - 5 hours
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/