Page 1 of 1

Recursive HTML tabber

Posted: Sun Jun 17, 2007 12:25 pm
by superdezign
I wrote a function that tabs HTML and traverses the HTML structure. I figured it'd be simple practice with recursion, except I've ran into a problem. It tabs the HTML fine, but if there are elements without closing tags (img, meta, br) or text AND elements with closing tags with the same parent tag, only the opened and closed tags remains. I believe I could fix it by using str_replace instead of equals, but the problem is determining what to put into str_replace.

Here is the class. All of the action is in FormatHTML().

Code: Select all

class CHTMLTabber
{
	const		__TAB__		= '    ';
	
	/**
	*	Shows tabbed html
	*	@param		str
	*	@param		return (boolean)
	*	@param		level
	*	@return		string or void
	*/
	public function TabHTML($str, $return = false, $level = 0)
	{
		$str		= self::FormatHTML($str);
		
		if(!$return)
		{
			echo '<pre>' . htmlspecialchars(CHTMLTabber::FormatHTML($str), $level) . '</pre>';
			return;
		}
		
		return $str;
	}
	
	/**
	*	Add tabs to the HTML str
	*	@param		str
	*	@return		string
	*/
	protected function FormatHTML($str, $level = 0)
	{
		$str				= self::ClearNewlines($str);
		
		preg_match_all('|
			
			(
				<					# open HTML tag
				(
					(?<!/)			# does not start with a slash
					[^>\s]+			# all contents up to a space or end of tag
				)
				.*?					# all contents to end of tag
				>					# close HTML tag
			)
			(
				.*?					# all contents to ending tag
			)
			(
				</					# open ending HTML tag
				(
					\2				# same as content of pattern 2 (HTML tag name)
				)
				>					# close ending HTML tag
			)
			
			|xsi', $str, $matches);
		
		/**
		*	REGEX EXPLANATION:
		*	
		*	$matches[0]			= entire HTML tag and content
		*		i.e. <html>content</html>
		*
		*	$matches[1]			= starting HTML tag
		*		i.e. <html>
		*
		*	$matches[2]			= HTML tag name
		*		i.e. html
		*
		*	$matches[3]			= contents
		*		i.e. content
		*
		*	$matches[4]			= closing HTML tag
		*		i.e. </html>
		*
		*	$matches[5]			= closing HTML tag name (currently unused)
		*		i.e. html
		*/
		
		// Create tabs for this level
		for($i = 0, $tabs = ''; $i < $level; $i++)
		{
			$tabs			.= self::__TAB__;
		}
		
		// Format HTML
		if(!empty($matches[3]))
		{
			$str			= '';
			
			foreach($matches[3] as $id => $content)
			{
				$content	= trim(self::FormatHTML($content, $level + 1));
				
				// Don't add extra newlines or tabs
				if($id >= 1)
				{
					$str	.= "\n" . $tabs;
				}
				
				$str		.= $matches[1][$id] . "\n";
				
				// Don't output anything for empty contents
				if(!empty($content))
				{
					$str	.= $tabs . self::__TAB__ . $content . "\n";
				}
				
				$str		.= $tabs . $matches[4][$id];
			}
		}
		
		return $str;
	}
	
	/**
	*	Removes newlines from a string
	*	@param		str
	*	@return		string
	*/
	protected function ClearNewlines($str)
	{
		return preg_replace('#(\r\n|\n)#s', '', $str);
	}
};
It's possible that I'm going in the wrong direction with this, so I'm open to any and all suggestions. The regex is describe in a comment so that if you don't feel like deciphering it, it's already there.

Posted: Sun Jun 17, 2007 7:47 pm
by feyd
Have you looked at HTMLPurifier?