Page 1 of 1

Finding Banned Words On A Page And Not Within Other Words

Posted: Wed Oct 04, 2017 5:29 pm
by UniqueIdeaMan
Folks!

I am trying to add a banned words filter onto a web proxy.
I am NOT searching for banned words within other words on a page but searching for banned words within a loaded page.
I am not actually looking for banned words within other words but within the page (meta tags, content).

And so, if I am looking for the word "cock", then the word "cockerel" should not trigger the filter.

I just tested this code and, yes, as expected the code works but as you can guess there is a lot of cpu power cycling through. One moment the page loads, the other moment it goes grey and shows signs that the page is taking too long to load. And all this on localhost. Now, I can imagine what my webhost would do!
So now, we will have to come-up with a better solution. Any ideas ?
How-about we do not get the script to check on the loaded page for all the banned words ? How-about we get the script to halt as soon as 1 banned word is found and an echo has been made which banned word has been found and where on the page ? (meta tags, body content, etc.).
Any code suggestions ?

Here is what I got so far:

Code: Select all

    <?php
 
    /*
    ERROR HANDLING
    */
 
    // 1). $curl is going to be data type curl resource.
    $curl = curl_init();
 
    // 2). Set cURL options.
    curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
    words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );
 
    // 3). Run cURL (execute http request).
    $result = curl_exec($curl);
    $response = curl_getinfo( $curl );
 
    if( $response['http_code'] == '200' )
        {
            //Set banned words.
            $banned_words = array("Prick","Dick","***");
 
            //Separate each words found on the cURL fetched page.
            $word = explode(" ", $result);
    
           //var_dump($word);
 
           for($i = 0; $i <= count($word); $i++)
           {
               foreach ($banned_words as $ban) 
               {
                  if (strtolower($word[$i]) == strtolower($ban))
                  {
                      echo "word: $word[$i]<br />";
                      echo "Match: $ban<br>";
               }
              else
               {
                     echo "word: $word[$i]<br />";
                     echo "No Match: $ban<br>";  
                }
             }
          }
       }  
 
    // 4). Close cURL resource.
    curl_close($curl);
I am told to do it like this:

**Load the page into a string.
Use preg_match with "word boundaries" on the loaded string and loop through your banned words.**

UPDATE:
I updated my code inserting miknik's codes. It was working fine until I added this line before the cURL:
$banned_words = array("Prick","Dick","***");

Here's the update:

Code: Select all

    <?php
 
    /*
    ERROR HANDLING
    */

    // 1). Set banned words.
    $banned_words = array("Prick","Dick","***");
 
    // 2). $curl is going to be data type curl resource.
    $curl = curl_init();
 
    // 3). Set cURL options.
    curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
    words-
    you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );
 
    // 4). Run cURL (execute http request).
    $result = curl_exec($curl);
    $response = curl_getinfo( $curl );
 
    if($response['http_code'] == '200' )
	     {
			  $regex = '/\b';      // The beginning of the regex string syntax
			  $regex .= implode('\b|\b', $banned_words);      // joins all the 
              banned words to the string with correct regex syntax
			  $regex .= '\b/i';    // Adds ending to regex syntax. Final i makes 
              it case insensitive
			  $substitute = '****';
			  $cleanresult = preg_replace($regex, $substitute, $result);
			  echo $cleanresult;
	     }

      curl_close($curl);

      ?>
Why do I now see a complete blank page ?

Re: Finding Banned Words On A Page And Not Within Other Word

Posted: Thu Oct 05, 2017 5:01 am
by Celauran
UniqueIdeaMan wrote:I updated my code inserting miknik's codes. It was working fine until I added this line before the cURL:
$banned_words = array("Prick","Dick","***");
Are you suggesting that simply creating an array broke functionality? Dubious. What sorts of errors are you seeing?

Re: Finding Banned Words On A Page And Not Within Other Word

Posted: Thu Oct 05, 2017 7:01 am
by UniqueIdeaMan
Celauran wrote:
UniqueIdeaMan wrote:I updated my code inserting miknik's codes. It was working fine until I added this line before the cURL:
$banned_words = array("Prick","Dick","***");
Are you suggesting that simply creating an array broke functionality? Dubious. What sorts of errors are you seeing?
I get a complete blank page. No error. Error reporting on.
Update:

Code: Select all

<?php

/*
ERROR HANDLING
*/
declare(strict_types=1);
ini_set('display_errors', '1');
ini_set('display_startup_errors', '1');
error_reporting(E_ALL);
mysqli_report(MYSQLI_REPORT_ERROR | MYSQLI_REPORT_STRICT);


// 1). Set banned words.
$banned_words = array("Prick","Dick","<span style='color:blue' title='I&#39;m naughty, are you naughty?'>smurf</span>");

// 2). $curl is going to be data type curl resource.
$curl = curl_init();

// 3). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-
you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 4). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if($response['http_code'] == '200' )
     {
          $regex = '/\b'; // The beginning of the regex string syntax
          $regex .= implode('\b|\b', $banned_words); // joins all the banned words to the string with correct regex syntax
          $regex .= '\b/i'; // Adds ending to regex syntax. Final i makes it case insensitive
          $substitute = 'd';
          $cleanresult = preg_replace($regex, $substitute, $result);
          echo $cleanresult;
     }

  curl_close($curl);

  ?>

Re: Finding Banned Words On A Page And Not Within Other Word

Posted: Thu Oct 05, 2017 5:00 pm
by UniqueIdeaMan
Celeraun,

Why don't you run my code on your Note Pad++ and see for yourself the blank page.
This is very very strange!

Re: Finding Banned Words On A Page And Not Within Other Word

Posted: Thu Oct 05, 2017 7:49 pm
by Celauran
Your echo statement is inside a conditional. Have you checked the response from cURL? Maybe you're not getting a 200.

Re: Finding Banned Words On A Page And Not Within Other Word

Posted: Fri Oct 06, 2017 6:16 am
by UniqueIdeaMan
I was having word wrapping problem in my Note Pad++. Sorted now.
This edited code is working.

Code: Select all

<?php
/*
ERROR HANDLING
*/
// 1). Set banned words.
$banned_words = array("blow", "nut", "<span style='color:blue' title='I&#39;m naughty, are you naughty?'>smurf</span>");
// 2). $curl is going to be data type curl resource.
$curl = curl_init();
// 3). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );
// 4). Run cURL (execute http request).
$result = curl_exec($curl);
if (curl_errno($curl)) {
    echo 'Error:' . curl_error($curl);
}
$response = curl_getinfo( $curl );
if($response['http_code'] == '200' )
{
    $regex = '/\b';     
    $regex .= implode('\b|\b', $banned_words);   
    $regex .= '\b/i'; 
    $substitute = '****';
    $cleanresult = preg_replace($regex, $substitute, $result);
    echo $cleanresult;
}
curl_close($curl);
?>
Original code newbies can grab:
http://phpfiddle.org/main/code/0trx-6fng