strip_tags, but allow PHP tags

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
A1phanum3ric
Forum Newbie
Posts: 9
Joined: Tue May 30, 2006 6:26 am
Location: Torbay, UK

strip_tags, but allow PHP tags

Post by A1phanum3ric »

I've searched this forum for an answer to the above question to no avail. My question is How can I use strip_tags to strip all tags except PHP tags (and the code inside the tags).

I've tried the following:

Code: Select all

<?php
strip_tags($sString, '<?');
strip_tags($sString, '<?php');
strip_tags($sString, '<??>');
strip_tags($sString, '<?php?>');
?>
but strip_tags still strips the PHP tags and code :(

Can anyone point me in the right direction?

Cheers,

Ed.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Striptags() probably isn't the right function for the job. It doesn't even do that good of a job cleaning user input for display. What for?
User avatar
A1phanum3ric
Forum Newbie
Posts: 9
Joined: Tue May 30, 2006 6:26 am
Location: Torbay, UK

Post by A1phanum3ric »

Ah well I'm kinda making a nice mess of this as I go on... but basically it's for a 'reply-to-blog/reply-to-code' function so clients can reply to my clode blogs on my website using both BBCode tags ([url=* ]) and PHP BBCode tags (

Code: Select all

). Independently they both work fine (the BBCode tags and PHP BBCode tags) but when I try and format some code that uses something like:

[syntax=php]Hello, my name is [b ]Ed[/b ].

Check out this code:

[php ]<?php $i="geek"; ?>[/php ][/syntax]

I get problems because I'm stripping all HTML tags, then converting the BBCode to HTML as well as the PHP BBCode... HOWEVER, the PHP part gets stripped out when I use strip_tags, therefore removing all existance of the code snippet.

Hope you understand what I'm saying!

Ed.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Don't strip tags. htmlentity-ize them.
User avatar
A1phanum3ric
Forum Newbie
Posts: 9
Joined: Tue May 30, 2006 6:26 am
Location: Torbay, UK

Post by A1phanum3ric »

I forgot to mention that I'm also using highlight_string on the PHP BBCode part, therefore htmlentity-izing them will break this as it looks for <?php at the beginning of the string, and ?> at the end.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Absolutely delightful.

Well, at this point, I would recommend creating a stack-based parser for the BBcode. The idea is to break the file into chunks of BBTags and Text. Text gets htmlentity-ized, while BBTags get converted into HTML tags.

Or you could try to find a BBCode package out there that does this...
User avatar
A1phanum3ric
Forum Newbie
Posts: 9
Joined: Tue May 30, 2006 6:26 am
Location: Torbay, UK

Post by A1phanum3ric »

Haha I knew that'd throw a spanner in the works. Yeah, I'm currently working on breaking up the

Code: Select all

[/php ] parts from the other text, sorting them out separately, then putting it all back together before stuffing in a nice MySQL DB. Any tips on extracting chunks of strings based on the beginning and end of the string you wanna extract?

I was gonna do the explode("[php ]", $sString) way, but that gets too messy... Then I thought about actually searching for the beginning and end of the string using substr and stripos etc...

Any ideas would be cool, otherwise worry not I'll sort it!

Ed.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Some code to get you thinking. It's for regular HTML, but it could be adapted for BBCode too.

Code: Select all

<?php

/*
Forgivingly lexes SGML style documents, aka HTML, XML, XHMTML, you name it.

TODO:
 * Reread the XML spec and make sure I got everything right
 * Add support for CDATA sections
 * Have comments output with the leading and trailing --s
 * Optimize and benchmark
 * Check MF_Text behavior: shouldn't the info in there be raw (entities parsed?)

*/

class HTML_Lexer
{
    
    function nextQuote($string, $offset = 0) {
        $quotes = array('"', "'");
        return $this->next($string, $quotes, $offset);
    }
    
    function nextWhiteSpace($string, $offset = 0) {
        $spaces = array(chr(0x20), chr(0x9), chr(0xD), chr(0xA));
        return $this->next($string, $spaces, $offset);
    }
    
    function next($haystack, $needles, $offset = 0) {
        if (is_string($needles)) {
            $string_needles = $needles;
            $needles = array();
            $size = strlen($string_needles);
            for ($i = 0; $i < $size; $i++) {
                $needles[] = $string_needles{$i};
            }
        }
        $positions = array();
        foreach ($needles as $needle) {
            $position = strpos($haystack, $needle, $offset);
            if ($position !== false) {
                $positions[] = $position;
            }
        }
        return empty($positions) ? false : min($positions);
    }
    
    function tokenizeHTML($string) {
        
        // some quick checking (if empty, return empty)
        $string = (string) $string;
        if ($string == '') return array();
        
        $cursor = 0; // our location in the text
        $inside_tag = false; // whether or not we're parsing the inside of a tag
        $array = array(); // result array
        
        // infinite loop protection
        // has to be pretty big, since html docs can be big
        // we're allow two hundred thousand tags... more than enough?
        $loops = 0;
        
        while(true) {
            
            // infinite loop protection
            if (++$loops > 200000) return array();
            
            $position_next_lt = strpos($string, '<', $cursor);
            $position_next_gt = strpos($string, '>', $cursor);
            
            // triggers on "<b>asdf</b>" but not "asdf <b></b>"
            if ($position_next_lt === $cursor) {
                $inside_tag = true;
                $cursor++;
            }
            
            if (!$inside_tag && $position_next_lt !== false) {
                // We are not inside tag and there still is another tag to parse
                $array[] = new MF_Text(html_entity_decode(substr($string, $cursor, $position_next_lt - $cursor)));
                $cursor  = $position_next_lt + 1;
                $inside_tag = true;
                continue;
            } elseif (!$inside_tag) {
                // We are not inside tag but there are no more tags
                // If we're already at the end, break
                if ($cursor === strlen($string)) break;
                // Create Text of rest of string
                $array[] = new MF_Text(html_entity_decode(substr($string, $cursor)));
                break;
            } elseif ($inside_tag && $position_next_gt !== false) {
                // We are in tag and it is well formed
                // Grab the internals of the tag
                $segment = substr($string, $cursor, $position_next_gt - $cursor);
                
                // Check if it's a comment
                if (substr($segment,0,3) == '!--' && substr($segment,strlen($segment)-2,2) == '--') {
                    $array[] = new MF_Comment(substr($segment,3,strlen($segment)-5));
                    $inside_tag = false;
                    $cursor = $position_next_gt + 1;
                    continue;
                }
                
                // Check if it's an end tag
                $is_end_tag = (strpos($segment,'/') === 0);
                if ($is_end_tag) {
                    $type = substr($segment, 1);
                    $array[] = new MF_EndTag($type);
                    $inside_tag = false;
                    $cursor = $position_next_gt + 1;
                    continue;
                }
                
                // Check if it is explicitly self closing, if so, remove
                // trailing slash. Remember, we could have a tag like <br>, so
                // any later token processing scripts must convert improperly
                // classified EmptyTags from StartTags.
                $is_self_closing = (strpos($segment,'/') === strlen($segment) - 1);
                if ($is_self_closing) {
                    $segment = substr($segment, 0, strlen($segment) - 1);
                }
                
                // Check if there are any attributes
                $position_first_space = $this->nextWhiteSpace($segment);
                if ($position_first_space === false) {
                    if ($is_self_closing) {
                        $array[] = new MF_EmptyTag($segment);
                    } else {
                        $array[] = new MF_StartTag($segment, array());
                    }
                    $inside_tag = false;
                    $cursor = $position_next_gt + 1;
                    continue;
                }
                
                // Grab out all the data
                $type = substr($segment, 0, $position_first_space);
                $attribute_string = trim(substr($segment, $position_first_space));
                $attributes = $this->tokenizeAttributeString($attribute_string);
                if ($is_self_closing) {
                    $array[] = new MF_EmptyTag($type, $attributes);
                } else {
                    $array[] = new MF_StartTag($type, $attributes);
                }
                $cursor = $position_next_gt + 1;
                $inside_tag = false;
                continue;
            } else {
                $array[] = new MF_Text('<' . html_entity_decode(substr($string, $cursor)));
                break;
            }
            break;
        }
        return $array;
    }
    
    function tokenizeAttributeString($string) {
        $string = (string) $string;
        if ($string == '') return array();
        $array = array();
        $cursor = 0;
        $in_value = false;
        $i = 0;
        $size = strlen($string);
        
        // if we have unquoted attributes, the parser expects a terminating
        // space, so let's guarantee that there's always a terminating space.
        $string .= ' ';
        
        // infinite loop protection
        $loops = 0;
        
        while(true) {
            
            // infinite loop protection
            // if we've looped 1000 times, abort. Nothing good can come of this 
            if (++$loops > 1000) return array();
            
            if ($cursor >= $size) {
                break;
            }
            $position_next_space = $this->nextWhiteSpace($string, $cursor);
            //scroll to the last whitespace before text
            while ($position_next_space === $cursor) {
                $cursor++;
                $position_next_space = $this->nextWhiteSpace($string, $cursor);
            }
            $position_next_equal = strpos($string, '=', $cursor);
            if ($position_next_equal !== false &&
                 ($position_next_equal < $position_next_space ||
                  $position_next_space === false)) {
                //attr="asdf"
                // grab the key
                $key = trim(substr($string, $cursor, $position_next_equal - $cursor));
                
                // set cursor right after the equal sign
                $cursor = $position_next_equal + 1;
                
                // consume all spaces after the equal sign
                $position_next_space = $this->nextWhiteSpace($string, $cursor);
                while ($position_next_space === $cursor) {
                    $cursor++;
                    $position_next_space = $this->nextWhiteSpace($string, $cursor);
                }
                
                // if we've hit the end, assign the key an empty value and abort
                if ($cursor >= $size) {
                    $array[$key] = '';
                    break;
                }
                
                // find the next quote
                $position_next_quote = $this->nextQuote($string, $cursor);
                
                // if the quote is not where the cursor is, we're dealing
                // with an unquoted attribute
                if ($position_next_quote !== $cursor) {
                    if ($key) {
                        $array[$key] = trim(substr($string, $cursor,
                          $position_next_space - $cursor));
                    }
                    $cursor = $position_next_space + 1;
                    continue;
                }
                
                // otherwise, regular attribute
                $quote = $string{$position_next_quote};
                $position_end_quote = strpos($string, $quote, $position_next_quote + 1);
                
                // check if the ending quote is missing
                if ($position_end_quote === false) {
                    // it is, assign it to the end of the string
                    $position_end_quote = $size;
                }
                
                $value = substr($string, $position_next_quote + 1,
                  $position_end_quote - $position_next_quote - 1);
                if ($key) {
                    $array[$key] = html_entity_decode($value);
                }
                $cursor = $position_end_quote + 1;
            } else {
                //boolattr
                if ($position_next_space === false) {
                    $position_next_space = $size;
                }
                $key = substr($string, $cursor, $position_next_space - $cursor);
                if ($key) {
                    $array[$key] = $key;
                }
                $cursor = $position_next_space + 1;
            }
        }
        return $array;
    }
    
}

// uses the PEAR class XML_HTMLSax3 to parse XML
//   only shares the tokenizeHTML() function
class HTML_Lexer_Sax extends HTML_Lexer
{
    
    var $tokens = array();
    
    function tokenizeHTML($html) {
        $this->tokens = array();
        $parser=& new XML_HTMLSax3();
        $parser->set_object($this);
        $parser->set_element_handler('openHandler','closeHandler');
        $parser->set_data_handler('dataHandler');
        $parser->set_escape_handler('escapeHandler');
        $parser->set_option('XML_OPTION_ENTITIES_PARSED', 1);
        $parser->parse($html);
        return $this->tokens;
    }
    
    function openHandler(&$parser, $name, $attrs, $closed) {
        if ($closed) {
            $this->tokens[] = new MF_EmptyTag($name, $attrs);
        } else {
            $this->tokens[] = new MF_StartTag($name, $attrs);
        }
        return true;
    }
    
    function closeHandler(&$parser, $name) {
        // HTMLSax3 seems to always send empty tags an extra close tag
        // check and ignore if you see it:
        // [TESTME] to make sure it doesn't overreach
        if (is_a($this->tokens[count($this->tokens)-1], 'MF_EmptyTag')) {
            return true;
        }
        $this->tokens[] = new MF_EndTag($name);
        return true;
    }
    
    function dataHandler(&$parser, $data) {
        $this->tokens[] = new MF_Text($data);
        return true;
    }
    
    function escapeHandler(&$parser, $data) {
        if (strpos($data, '-') === 0) {
            $this->tokens[] = new MF_Comment($data);
        }
        return true;
    }
    
}

?>
Post Reply