Parsing HTML - how to avoid tags?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

roughly, yes...

potentially, the fastest to write (for me anyways) would be a preg_split() that breaks all the tags out (saving them in the resultant array) then analyzing each element in the array to determine it's status (tag or not) .. ignore tags, process non tags. However, I'm pretty sure a string parser would be faster in execution if speed is more critical..
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

My quick and dirty generic solution would be.....

Code: Select all

<?php
function tagsafe_replace($search, $replace, $subject, $casesensitive = false)
{
	$subject = '>' . $subject . '<';
	$search = preg_quote($search);
	
	$cs = !$casesensitive ? 'i' : '';
	
	preg_match_all('/>[^<]*(' . $search . ')[^<]*</i', $subject, $matches, PREG_PATTERN_ORDER);
	
	foreach($matches[0] as $match)
	{
		$tmp     = preg_replace("/($search)/", $replace, $match);
		$subject = str_replace($match, $tmp, $subject);
	}

	return substr($subject, 1, -1);
}

// example.....

$link['PHP']       = 'PHP: Hypertext Preprocessor';
$link['HTML']      = 'HTML: Hypertext Markup Language';
$link['FAQ']       = 'Frequently Asked Questions';
$link['describer'] = 'This script - used to automatically explain terms on web page... as this one ';

$source = '<script type="text/javascript" src="overlib.js"><!-- overLIB (c) Erik Bosrup --></script>
<div id="overDiv" style="position:absolute; visibility:hidden; z-index:1000;"></div>
This is normal HTML page with PHP extension made by Avram<br><br>
Here you can see our FAQ:<br>
Q: What is "describer"?<br>
A: A tool which is used to automatically describe words in text.<br>

<br>
End of FAQ';

$mover = '<a onmouseover="return overlib(\'';
$mout  = '\');" onmouseout="return nd();">';

foreach ($link as $word => $hover)
{
	$source = tagsafe_replace($word, $mover . $hover . $mout . '<b>$1</b></a>', $source, true);
}

echo $source;

?>
Dependant on size and complexity of the content you are parsing a more specific function may be better/more efficient.

The regex used is not bullet proof either, I know there are certain circumstances where the match string is not picked up.

The resulting output from above is...

Code: Select all

<script type="text/javascript" src="overlib.js"><!-- overLIB (c) Erik Bosrup --></script>
<div id="overDiv" style="position:absolute; visibility:hidden; z-index:1000;"></div>
This is normal <a onmouseover="return overlib('HTML: Hypertext Markup Language');" onmouseout="return nd();"><b>HTML</b></a> page with <a onmouseover="return overlib('PHP: Hypertext Preprocessor');" onmouseout="return nd();"><b>PHP</b></a> extension made by Avram<br><br>
Here you can see our <a onmouseover="return overlib('Frequently Asked Questions');" onmouseout="return nd();"><b>FAQ</b></a>:<br>
Q: What is "<a onmouseover="return overlib('This script - used to automatically explain terms on web page... as this one ;)');" onmouseout="return nd();"><b>describer</b></a>"?<br>
A: A tool which is used to automatically describe words in text.<br>

<br>
End of <a onmouseover="return overlib('Frequently Asked Questions');" onmouseout="return nd();"><b>FAQ</b></a>
Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Post by Avram »

it seems that this is solution ;)

thank you very much!
Post Reply