PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Sat Jul 20, 2019 3:47 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Thu Dec 15, 2011 4:51 pm 
Offline
Forum Contributor

Joined: Sat Mar 14, 2009 5:16 pm
Posts: 112
I have a function which filters some bad words, I do not use word boundaries on it, but there are some words that contain some bad words but they are not bad, as for example Essex, I do not want to filter essex and block the submitting of content, I want to block all other words that contain se* excluding essex.


Top
 Profile  
 
PostPosted: Thu Dec 15, 2011 5:43 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
You say you don't use word boundaries. Do you mean that you don't want to use:

Syntax: [ Download ] [ Hide ]
\bsax\b


which would match sax, but not Essax, saxual etc?

Maybe I am misunderstanding the purpose of your question.
If you can clarify (maybe what regex flavor you are using, and why not word boundaries), I'll be delighted to take a crack at it. :)


Top
 Profile  
 
PostPosted: Mon Dec 19, 2011 5:04 pm 
Offline
Forum Contributor

Joined: Sat Mar 14, 2009 5:16 pm
Posts: 112
Yep I do not want to use \bsax\b because then users can write for example saxx and the system will not block this word, I want to block the submit if the content contains word with sax like sax, saxx, newsaxx, saxest etc... but not for example essax. I hope I was clear.


Top
 Profile  
 
PostPosted: Mon Dec 19, 2011 5:19 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Ah, I see... So you want to explicitly allow Essax.
(Do you have a list of words you want to explicitly allow?)

If I understood, this expression should do the trick. It contains a negative lookbehind that checks for the letters "es" before "sax".

Syntax: [ Download ] [ Hide ]
(?i)(?<!es)sax


So "saxual" and "saxx" will be matched, but not "essax".

In case you are not familiar with it, (?i) is an inline flag that turns on the case insensitive mode. Of course in a preg application of the regex you could also turn it on for the whole expression by appending "i" to the right of the closing delimiter, as in:

Syntax: [ Download ] [ Hide ]
$regex=',(?<!es)sax,i';


Are we getting close?
I am quite curious to see how this expression develops. In particular what tickles me is the question of the whitelist. Because there is no easy rule to distinguish "saxy" from "saxophone".
Looking forward to your next message.


Top
 Profile  
 
PostPosted: Tue Dec 20, 2011 10:42 am 
Offline
Forum Contributor

Joined: Sat Mar 14, 2009 5:16 pm
Posts: 112
I have the list of white words, like Essax, benedick etc... so they are certain words in a list.

Thanks a lot for you time.


Top
 Profile  
 
PostPosted: Tue Dec 20, 2011 2:53 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
It's a pleasure. Fun to see what you're using regex for.


Top
 Profile  
 
PostPosted: Tue Dec 20, 2011 4:15 pm 
Offline
Forum Contributor

Joined: Sat Mar 14, 2009 5:16 pm
Posts: 112
Yes but do you know how to use with a white list, with exactly words?

I use this now, but I need something to find bad words excluding white words, I need something that my client could write all whitewords in a variable $whitewords:
Syntax: [ Download ] [ Hide ]
preg_match_all("/(" . $badWords . ")/i", $content, $matches_a);


Top
 Profile  
 
PostPosted: Tue Dec 20, 2011 4:54 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Hi MicroBoy,

For me, at this stage it's no longer a regex question, but a coding question. But I might be wrong.

Let's say your client gives you "Essax" and "Middlesax" as words for the white list. And a hundred others. Now to apply the regex I gave you for Essax, you're going to need another regex to help you parse the white list and write the appropriate lookaheads and lookbehinds into an expression.
And you will need one expression for "sax", one for "duck", and so on.
So you have an array of bad words. For each bad word, an array of white words. For instance:

Syntax: [ Download ] [ Hide ]
$bad[0][0]='sax';
$bad[0][1]='essax';
$bad[0][2]='middlesax';
$bad[1][0]='duck';
$bad[1][1]='donaldduck';
 


At some stage (hopefully during construction, not at run time), you prepare a list of expressions to dump in your code. For that, you write a function that iterates over $bad and builds you a series of regular expressions to use in your final code.

That's one way to do it. But I haven't done it before, and I don't know how this kind of feature is implemented in real life. That's why I was suggesting that it might be a coding question at this stage. A few other ideas:

- for inspiration, you could look at SMF (the Simple Machines Forum), which this very forum looks to be built on. It is open source and has a bad word filter that you can probably dig up.
- you may find a script already made, maybe on hostscripts.
- or even a PEAR library with the perfect function, to which you can pass your array.
- bad word filter, google is your friend.

Wishing you a beautiful day,

-A


Top
 Profile  
 
PostPosted: Wed Dec 21, 2011 1:17 pm 
Offline
Forum Contributor

Joined: Sat Mar 14, 2009 5:16 pm
Posts: 112
Just wanted to inform and others that have the same issue that I found the solution, using preg_replace I first replaced all words from the white list that are in content with #$$$# and then checked the content so basically the content was free from white list words and it did not marked as badwords cause they are replaced with #$$$#. I hope I was clear.

Best wishes.


Top
 Profile  
 
PostPosted: Wed Dec 21, 2011 1:56 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Sweet. So simple. Thanks for sharing.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group