Parsing HTML - how to avoid tags?
Moderator: General Moderators
Parsing HTML - how to avoid tags?
Okay, here is the problem:
Before any output at the page, I am using ob_start(); ... when the page loads, I use the output from buffer to "parse" it, 'cause I need to "explain" some terms (take-a-look). But I was "parsing" the buffer with simple str_replace. Now, this will mess up links and other tags within the page. My question is how to avoid "touching" html tags?
E.g. I need to explain term "xhtml". All terms "xhtml" will be "parsed" (replaced) with e.g. "<a title='eXtensible HyperText Markup Language'>xhtml</a>"... but if I have "<a href='http://super-xhtml-site.com'>link</a>", then my current "parsing" method will make: "<a href='http://super-<a title='eXtensible HyperText Markup Language'>xhtml</a>-site.com'>link</a>" <- and you see what will happen...
How to avoid that? (Don't tell me to watch which addresses will I place in links, 'cause maybe all users will be able to post some text (and links)).
Thanks in advance!
Before any output at the page, I am using ob_start(); ... when the page loads, I use the output from buffer to "parse" it, 'cause I need to "explain" some terms (take-a-look). But I was "parsing" the buffer with simple str_replace. Now, this will mess up links and other tags within the page. My question is how to avoid "touching" html tags?
E.g. I need to explain term "xhtml". All terms "xhtml" will be "parsed" (replaced) with e.g. "<a title='eXtensible HyperText Markup Language'>xhtml</a>"... but if I have "<a href='http://super-xhtml-site.com'>link</a>", then my current "parsing" method will make: "<a href='http://super-<a title='eXtensible HyperText Markup Language'>xhtml</a>-site.com'>link</a>" <- and you see what will happen...
How to avoid that? (Don't tell me to watch which addresses will I place in links, 'cause maybe all users will be able to post some text (and links)).
Thanks in advance!
this belongs in regex.
This should make sure "xhtml" isn't inside a tag already.
I think that's right. Untested.
This should make sure "xhtml" isn't inside a tag already.
Code: Select all
$txt = preg_replace('/xhtml(?>![^<]+?>)/i','<yourinfotag>xhtml</yourinfotag>',$txt);tested with this (want to test before implementing to my app, just to make sure):
doesn't work 
Code: Select all
<?php
ob_start();
?>
<a href='blahblah'>test</a> - blah
<?php
$txt = ob_get_contents();
ob_end_clean();
$txt = preg_replace('/blah(?>![^<]+?>)/i','<b>blah</b>',$txt);
echo $txt;
?>I use the Regex Coach whenever I get stuck with regular expressions. It's a really handy tool and lets you see what exactly your pattern is doing, if at all. Regex's can be a real pain if you don't use them 24/7. They're pretty damn powerful, nevertheless.
feyd: Please explain how can I do that w/out RegEx?
Nathaniel: That's the final solution, if I can't solve problem with something else, I'll use this, because now I want it to look like this (to use overlib). But, what if I want to explain "html"? What will happen with <html> tag?
edit: but even if I use acronym... when the term is in pointing URL of the <A> tag... the url will be messed... so this is not the right way to do this
Nathaniel: That's the final solution, if I can't solve problem with something else, I'll use this, because now I want it to look like this (to use overlib). But, what if I want to explain "html"? What will happen with <html> tag?
edit: but even if I use acronym... when the term is in pointing URL of the <A> tag... the url will be messed... so this is not the right way to do this
Code: Select all
str_ireplace(' XHTML ', ' <a title="eXtensible HyperText Markup Language">xhtml</a> ');Code: Select all
<?php
$string = ' XHTML. osaudnfosjdnfon XHTML: asdasdad http://www.XHTML-boo.com asdasd xhtmlasdasdasd XhTmL';
echo preg_replace('/\s+(XHTML[\s\.:]?)/i', '<a title="eXtensible HyperText Markup Language">$1</a>', $string);
?>Code: Select all
<a title="eXtensible HyperText Markup Language">XHTML.</a> osaudnfosjdnfon<a title="eXtensible HyperText Markup Language">XHTML:</a> asdasdad http://www.XHTML-boo.com asdasd<a title="eXtensible HyperText Markup Language">xhtml</a>asdasdasd<a title="eXtensible HyperText Markup Language">XhTmL</a>EDIT: My regex isn't spot on, but that gives a starting point I guess.
Last edited by Jenk on Thu Oct 13, 2005 7:58 am, edited 1 time in total.