Page 1 of 1

Parsing a template, need help on design

Posted: Wed Mar 22, 2006 1:54 am
by fastfingertips
Hello i'm trying to create a template engine with some basic commands and to use it in my current application. But now i'm thinking how to start to parse it. My template file may look:

Code: Select all

<html>
	<head>
		<title><<:title:>></title>
	</head>
	<body>
		<<:foreach (:user: as $key=>$value):>>
		<tr>
			<td><<:user:username:>></td>
			<td><<:user:password:>></td>
			<td><<:user:email:>></td>
		</tr>
		<<:endfor:>>
	</body>
</html>
As you may notice i have simple values but also i have cycle commands (like foreach). I do not know how to start parsing the file, from values and replace them with the template or from template tags and when i will encounter a special tag to look in the values to see if i have a match.

I'm also thinking to create a command factory (to be able to implement more availabe commands like: for, if, while etc).

Posted: Wed Mar 22, 2006 2:05 am
by feyd
Although regex could be used, I'd recommend a string parser that does transformation into code that can be run through eval() (be very careful here)

Posted: Wed Mar 22, 2006 2:26 am
by fastfingertips
Since my commands will be limited i will use regex (as you helped me already at that parsing problem) and i will make eval at the end to the builded string.

At the moment i'm thinking for example to provide to the parser an array that looks like:

Code: Select all

$arrDetails = array(
	'title'=>'Welcome ',
	'user'=> array( 0 => array('username'=>'Puiu','password'=>'0007','email'=>'puiu@localhost.com'),
				1 => array('username'=>'Mind','password'=>'0002','email'=>'mind@localhost.com'),
				1 => array('username'=>'Lenuta','password'=>'0001','email'=>'lenuta@localhost.com'))
	);
How do you advice me to translate a command (the foreach command from example).

PS. I'm asking many things because i'm building a large application and i do not have full time to make an advanced analysis and also to make tests, i cannot afford to spend to much time to test new ways and i think your experience will help me a lot.

Posted: Wed Mar 22, 2006 2:27 am
by Chris Corbyn
I've never understood why people create templates that have such a "programming" feel to them. You've got logic in your template which sort of defies the point IMO, on top of that, you could have simply used PHP (a template language itself) to do that loop and thus there's just more to have to learn and remember this way.

When I create templates I do define those blocks that you've indictaed need to be looped over but I deal with all the logic in a controller ;)

Posted: Wed Mar 22, 2006 2:27 am
by fastfingertips
I'm using as you may notice the PHP ability to write this cycling instructions like:

Code: Select all

foreach($arrData  as $key=>$value) :
do something
endfor;

Posted: Wed Mar 22, 2006 2:29 am
by feyd
If you don't have time to create a parser, why not use Smarty?

Posted: Wed Mar 22, 2006 2:29 am
by fastfingertips
Simply because for example if the designer should change the row style he must be able to do it in the current page and not by opening another template file. That's why this basic commands are needed.

Posted: Wed Mar 22, 2006 2:32 am
by fastfingertips
Smarty will come with to many features and i do not need all of them and because of what i have now i cannot add Smarty in my project (sometimes we are limit to what we have :( )

Posted: Wed Mar 22, 2006 2:35 am
by Chris Corbyn
Well if it's any help here's the start of a generic parser I'm working on. It really needs so speed increases but it's a base. It wold be a lot faster if it wasn't so generic. You use configuration files (that array) to define your token definitions. I'll be extending it to make a generic lexical analyzer too.

Code: Select all

class lexer
{
	private $source;
	private $tokenDefinitions = array();
	private $tokenTypes = array();
	private $tokenLength = 0;
	private $tokens = array();
	private $inertTokens = array();

	function __construct($source)
	{
		$this->source = $source;
	}

	public function addDefinitions($arr)
	{
		$this->tokenDefinitions = array_values($arr);
		$this->tokenTypes = array_keys($arr);
		foreach ($this->tokenTypes as $k => $t) $this->defineOnce($t, $k);
		$this->tokenLength = count($this->tokenDefinitions);
	}

	public function getTokenName($val)
	{
		if (isset($this->tokenTypes[$val])) return $this->tokenTypes[$val];
	}

	private function defineOnce($d, $val)
	{
		if (!defined($d)) define($d, $val);
	}

	public function tokenize($str=false, $pos=0)
	{
		if ($str === false) $str = $this->source; //Start
		if (empty($str)) return false; //End of string
		
		$i = 0;
		
		foreach($this->tokenDefinitions as $type => $def)
		{
			$i++;
			
			if ($def[1] > 0)
			{
				preg_match($def[0], $str, $matches, PREG_OFFSET_CAPTURE);
				$tok = $matches[0];
			}
			else
			{
				$strpos = strpos($str, $def[0]);
				if ($strpos !== false) $tok = array(
					substr($str, $strpos, strlen($def[0])),
					$strpos
				);
				else $tok = array();
			}
			
			//No tokens found in string (or at least not at the start)
			if ($i == $this->tokenLength && (!isset($tok[1]) || $tok[1] != 0))
			{
				$last_tok = $this->getLastToken();
				
				die('<strong>Fatal:</strong> Undefined token at offset '.$pos.' in source (Near <em>\' '.$last_tok.' \'</em>)<br />');
			}
			elseif(isset($tok[1]) && $tok[1] == 0) //Token found at start
			{
				$len = strlen($tok[0]);
				$substr = substr($str, $len);
				$this->tokens[] = array('token' => $tok[0], 'type' => $type, 'offset' => $pos);
				if (strlen($substr) > 0)
				{
					//Move along the string and go all over again
					$this->tokenize($substr, $pos+$len);
				}
				break;
			}
		}
	}

	public function setInertTokens($arr)
	{
		$this->inertTokens = $arr;
	}

	private function getLastToken()
	{
		$tmp = array();
		foreach ($this->tokens as $arr)
		{
			if (!in_array($arr['type'], $this->inertTokens))
			{
				$tmp[] = $arr;
			}
		}
		$tmp2 = array_pop($tmp);
		return $tmp2['token'];
	}

	public function getTokens()
	{
		return $this->tokens;
	}
	
	public function dump()
	{
		echo '<pre>'.print_r($this, 1).'</pre>';
	}
}

Code: Select all

//Types should be listed in order of precedence
// For example, look for strings before variables since a variable inside a string is not valid
$tokenTypes = array(
	
	'TK_ESCAPE_CHARACTER'		=> array('\\\\', 0),	
	'TK_DOUBLE_STRING'		=> array('@(?<!\\\\)".*?(?<!\\\\)"@s', 1),
	'TK_LITERAL_STRING'		=> array("@(?<!\\\\)'.*?(?<!\\\\)'@s", 1),
	'TK_COMMENT'			=> array('@(?<!\\\\)/\\*(.*?)\\*/|//.*?$|#.*?$@sm', 1),
	'TK_VARIABLE'			=> array('@\\$[a-z_]\w*@i', 1),
	'TK_CLASS'			=> array('@\bclass\b@i', 1),
	'TK_FUNCTION'			=> array('@\b(?:c)?function\b@i', 1),
	'TK_INTERFACE'			=> array('@\binterface\b@i', 1),
	'TK_ECHO'			=> array('@\becho\b@i', 1),
	'TK_PRINT'			=> array('@\bprint\b@i', 1),
	'TK_EXIT'			=> array('@\bexit\b@i', 1),
	'TK_DIE'				=> array('@\bdie\b@i', 1),
	'TK_OPEN_TAG_WITH_ECHO'		=> array('@<\?=@i', 1),
	'TK_OPEN_TAG'			=> array('@<\?(?:php)?@i', 1),
	'TK_CLOSE_TAG'			=> array('?>', 0),
	'TK_ARRAY_CAST'			=> array('@\([ \t]*array[ \t]*\)@i', 1),
	'TK_DOUBLE_CAST'	=>	array('@\(\s*(?:double|float|real)\s*\)@i', 1),
	'TK_AND_EQUAL'			=> array('&=', 0),
	'TK_OBJECT_OPERATOR'		=> array('->', 0),
	'TK_DOUBLE_ARROW'		=> array('=>', 0),
	'TK_APPEND_OPERATOR'		=> array('.=', 0),
	'TK_NOT_EQUAL'			=> array('!=', 0),
	'TK_NOT_IDENTICAL'		=> array('!==', 0),
	'TK_BOOLEAN_AND'			=> array('&&', 0),
	'TK_BOOLEAN_OR'			=> array('||', 0),
	'TK_INC'				=> array('++', 0),
	'TK_DEC'				=> array('--', 0),
	'TK_IS_IDENTICAL'		=> array('===', 0),
	'TK_IS_EQUAL'			=> array('==', 0),
	'TK_LESS_THAN_OR_EQUAL'		=> array('<=', 0),
	'TK_GREATER_THAN_OR_EQUAL'	=> array('>=', 0),
	'TK_BITWISE_LEFT_SHIFT'		=> array('<<', 0),
	'TK_BITWISE_RIGHT_SHIFT'		=> array('>>', 0),
	'TK_EQUALS'			=> array('=', 0),
	'TK_RIGHT_PAREN'			=> array(')', 0),
	'TK_LEFT_PAREN'			=> array('(', 0),
	'TK_COMMA'			=> array(',', 0),
	'TK_CONCAT_OPERATOR'		=> array('.', 0),
	'TK_GREATER_THAN'		=> array('>', 0),
	'TK_LESS_THAN'			=> array('<', 0),
	'TK_REFERENCE_OPERATOR'		=> array('&', 0),
	'TK_LEFT_BRACKET'		=> array('[', 0),
	'TK_RIGHT_BRACKET'		=> array(']', 0),
	'TK_COLON'			=> array(':', 0),
	'TK_SEMICOLON'			=> array(';', 0),
	'TK_NEGATION_OPERATOR'		=> array('!', 0),
	'TK_RIGHT_BRACE'			=> array('}', 0),
	'TK_LEFT_BRACE'			=> array('{', 0),
	'TK_PLUS'			=> array('+', 0),
	'TK_MINUS'			=> array('-', 0),
	'TK_HEX_NUMERAL'			=> array('@0x[a-f0-9]+@i', 1),
	'TK_DECIMAL_OR_FLOAT'		=> array('@\d+\.\d+@', 1),
	'TK_OCT_NUMERAL'			=> array('@0\d+@', 1),
	'TK_INTEGER_NUMERAL'		=> array('@\d+@', 1),
	'TK_IF'				=> array('@\bif\b@i', 1),
	'TK_ELSE'			=> array('@\belse\b@i', 1),
	'TK_ELSEIF'			=> array('@\belseif\b@i', 1),
	'TK_ARRAY'			=> array('@\barray\b@', 1),
	'TK_AS'				=> array('@\bas\b@i', 1),
	'TK_PUBLIC'			=> array('@\bpublic\b@i', 1),
	'TK_PRIVATE'			=> array('@\bprivate\b@i', 1),
	'TK_PROTECTED'			=> array('@\bprotected\b@i', 1),
	'TK_VAR'				=> array('@\bvar\b@i', 1),
	'TK_STATIC'			=> array('@\bstatic\b@', 1),
	'TK_EXTENDS'			=> array('@\bextends\b@i', 1),
	'TK_IMPLEMENTS'			=> array('@\bimplements\b@i', 1),
	'TK_CASE'			=> array('@\bcase\b@i', 1),
	'TK_WHITESPACE'			=> array('@\s+@', 1),
	'TK_UNQUOTED_STRING'		=> array('@\w+@', 1), //Class names, function names, constants (The lexer will deal with this)
	'TK_UNKNOWN'			=> array('@\W@', 1)
	
);

$lex = new lexer(file_get_contents('index.php'));
$lex->addDefinitions($tokenTypes);
$lex->setInertTokens(array(TK_WHITESPACE));
$lex->tokenize();

$tok = $lex->getTokens();
foreach ($tok as $k => $arr)
{
    if ($arr['type'] == TK_COMMENT || $arr['type'] == TK_WHITESPACE)
    {
        unset($tok[$k]);
    }
}

echo '<table cellpadding=3 border=1>
<tr>
    <td><b>Token Type</b></td>
    <td><b>Token</b></td>
    <td><b>Offset</b></td>
</tr>
';

foreach ($tok as $arr)
{
    echo '<tr>
    <td>'.$lex->getTokenName($arr['type']).'</td>
    <td><div style="overflow: hidden; width: 500px;">'.nl2br(htmlentities($arr['token'])).'</div></td>
    <td>'.$arr['offset'].'</td>
</tr>
';
}

echo '</table>';
In the array a "1" means that it's a regex string, and a zero means it's a static string.

EDIT | Updated code as per my updated version. Now approx 15% faster @ 6900 bytes, 70 token defintions. (Main change == using constants/numbers rather than strings)