Page 1 of 1

Extremly compilcated regex/coding problem.

Posted: Mon Jul 16, 2007 9:23 pm
by lococobra
First an example:
Code:

Code: Select all

$code = 'Some html here <? echo\'PHP code always starts with <? or <?php, and ends with ?>\'; ?> more html';
Here's the problem... lets say this is the code for a web page and I'm trying to determine which parts are php and which parts are html. If I just split the string at every occurrence of <? or <?php or ?>, obviously there are going to be problems...

I highly doubt this could be done in a single regular expression, but multiple ones maybe. First step seems to be to detect where strings are in $code and ignore those areas, but then again, what if html contains something like...

Code: Select all

<form method="POST" action="<?php echo$_SERVER['PHP_SELF']?>">
As you can see, if all string areas are ignored, some valid php code may also be ignored.

I've gotten as far as developing the following code. However, even with my best attempt, I'm unable to produce the desired results.

Code: Select all

function findPHP($input){
	$pieces = preg_split('/(<\?.+?\?>)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
	$revised_pieces = array();
	for($i=0;$i<count($pieces);$i++){
		$piece = $pieces[$i];
		$quotes = 0;
		preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
		$revised_pieces[$i] = $piece;
		if (strpos($piece, '<?') === FALSE)
			continue;
		if ($quotes % 2) {
			list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
			$revised_pieces[$i] .= $before;
			$revised_pieces[$i+1] = $after;
			$i++;
		}
	}
	foreach($revised_pieces as $piece)
		if(strlen($piece)!=0)
			$output[] = $piece;
	return $output;
}

print_r(findPHP('html"<?php?>"html<?php"?><?"php?>html ?> end'));
?>
Output is:

Code: Select all

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>
    [4] => <?"php?>html ?>
    [5] =>  end
)
Should be:

Code: Select all

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?><?"php?>
    [4] => html ?> end
)
The following string should contain every possible problem that could be encountered

Code: Select all

html"<?php?>"html<?php"?><?"php?>html ?> end
If you can build a function that will turn that into the array of what it should be... I would consider you a God of PHP

Posted: Mon Jul 16, 2007 10:02 pm
by Benjamin
This should give you a good head start. ;)

page.php

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
  <title><?php echo $page_title; ?></title>
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
  <meta http-equiv="imagetoolbar" content="no" />
  <base href="<?php echo $base_url; ?>" />
  <link rel="stylesheet" type="text/css" href="css/style.css" media="screen" />
</head>
<body>
  <div id="container">
    <div class="gutter">
      <?php echo $something; ?>
      <?php echo $something_else; ?>
    </div>
  </div>
</body>
</html>
godofphp.php

Code: Select all

<?php
function godofphp($data)
{
    $return_data = array();

    preg_match_all('#<\?php.*\?>#im', $data, $php_pieces);
    $html_pieces = explode('<<<<php_code>>>', preg_replace('#<\?php.*\?>#im', '<<<<php_code>>>', $data));

    $reassembled_page           = array();
    $return_data['php_pieces']  = $php_pieces[0];
    $return_data['html_pieces'] = $html_pieces;
    
    $max_pieces = max(count($php_pieces[0]), count($html_pieces));
    
    if (preg_match('#^<\?php.*#im', $data))
    {
        for ($i = 0; $i < $max_pieces; $i++)
        {
            $reassembled_page[] = (count($php_pieces[0]) > 0) ? array_shift($php_pieces[0]) : '';
            $reassembled_page[] = (count($html_pieces) > 0) ? array_shift($html_pieces) : '';
        }
    } else {
        for ($i = 0; $i < $max_pieces; $i++)
        {
            $reassembled_page[] = (count($html_pieces) > 0) ? array_shift($html_pieces) : '';
            $reassembled_page[] = (count($php_pieces[0]) > 0) ? array_shift($php_pieces[0]) : '';
        }        
    }
    
    $return_data['reassembled_page'] = implode('', $reassembled_page);
    
    return $return_data;
}

echo print_r(godofphp(file_get_contents('page.php')), true);
output:

Code: Select all

Array
(
    [php_pieces] => Array
        (
            [0] => <?php echo $page_title; ?>
            [1] => <?php echo $base_url; ?>
            [2] => <?php echo $something; ?>
            [3] => <?php echo $something_else; ?>
        )

    [html_pieces] => Array
        (
            [0] => <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
  <title>
            [1] => </title>

  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
  <meta http-equiv="imagetoolbar" content="no" />
  <base href="
            [2] => " />
  <link rel="stylesheet" type="text/css" href="css/style.css" media="screen" />
</head>
<body>
  <div id="container">
    <div class="gutter">
      
            [3] => 
      
            [4] => 
    </div>

  </div>
</body>
</html>

        )

    [reassembled_page] => <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
  <title><?php echo $page_title; ?></title>
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
  <meta http-equiv="imagetoolbar" content="no" />

  <base href="<?php echo $base_url; ?>" />
  <link rel="stylesheet" type="text/css" href="css/style.css" media="screen" />
</head>
<body>
  <div id="container">
    <div class="gutter">
      <?php echo $something; ?>
      <?php echo $something_else; ?>
    </div>
  </div>

</body>
</html>

)

Posted: Mon Jul 16, 2007 10:11 pm
by lococobra
Unforunately, no :(

Using your function on my test string yields the following results:

Code: Select all

Array
(
    [php_pieces] => Array
        (
            [0] => <?php?>"html<?php"?><?"php?>html ?>
        )

    [html_pieces] => Array
        (
            [0] => html"
            [1] =>  end
        )

    [reassembled_page] => html"<?php?>"html<?php"?><?"php?>html ?> end
)
Which really... misses the majority of the trouble areas.

Posted: Mon Jul 16, 2007 10:19 pm
by Benjamin
Tweaking the regex a little bit yielded the following results..

Code: Select all

Array
(
    [php_pieces] => Array
        (
            [0] => <?php?>
            [1] => <?php"?>
            [2] => <?"php?>
        )

    [html_pieces] => Array
        (
            [0] => html"
            [1] => "html
            [2] => 
            [3] => html ?> end
        )

    [reassembled_page] => html"<?php?>"html<?php"?><?"php?>html ?> end
)
Your going to have to write a lexical parser if that function is not sufficient.

http://en.wikipedia.org/wiki/Lexical_analysis

Posted: Mon Jul 16, 2007 10:40 pm
by lococobra
Actually, I'm pretty sure it's possible with just regex.

The key is really a combination of a couple things...
• The first <? tag that's found is always valid
• All strings within php sections must be thrown out before continuing analysis.

So the process goes something like this:
1. Find first <? tag
2. Find whichever comes first, " or ' or ?>
3. If the result from step 3 was a string, do not allow processing of this string.
4. If the result from step 3 was a closing php tag, consider this a section to be parsed, next section is considered html.
5. Start from step 1 until there are no more php start tags to be found.

I started writing something that did that, but I can't seem to pull it all together.

EDIT:

Even with the tweaks, the php part was still not correct. Should be:

Code: Select all

[php_pieces] => Array
        (
            [0] => <?php?>
            [1] => <?php"?><?"php?>
        )

Posted: Mon Jul 16, 2007 11:13 pm
by feyd
lococobra, you have a private message that needs to be read.