Page 1 of 2

Parsing HTML - how to avoid tags?

Posted: Tue Oct 11, 2005 5:55 pm
by Avram
Okay, here is the problem:

Before any output at the page, I am using ob_start(); ... when the page loads, I use the output from buffer to "parse" it, 'cause I need to "explain" some terms (take-a-look). But I was "parsing" the buffer with simple str_replace. Now, this will mess up links and other tags within the page. My question is how to avoid "touching" html tags?

E.g. I need to explain term "xhtml". All terms "xhtml" will be "parsed" (replaced) with e.g. "<a title='eXtensible HyperText Markup Language'>xhtml</a>"... but if I have "<a href='http://super-xhtml-site.com'>link</a>", then my current "parsing" method will make: "<a href='http://super-<a title='eXtensible HyperText Markup Language'>xhtml</a>-site.com'>link</a>" <- and you see what will happen...
How to avoid that? (Don't tell me to watch which addresses will I place in links, 'cause maybe all users will be able to post some text (and links)).

Thanks in advance!

Posted: Tue Oct 11, 2005 6:16 pm
by Skara
this belongs in regex.

This should make sure "xhtml" isn't inside a tag already.

Code: Select all

$txt = preg_replace('/xhtml(?>![^<]+?>)/i','<yourinfotag>xhtml</yourinfotag>',$txt);
I think that's right. Untested.

Posted: Wed Oct 12, 2005 11:33 am
by Avram
tested with this (want to test before implementing to my app, just to make sure):

Code: Select all

<?php
ob_start();
?>

<a href='blahblah'>test</a> - blah

<?php

$txt = ob_get_contents();
ob_end_clean();

$txt = preg_replace('/blah(?>![^<]+?>)/i','<b>blah</b>',$txt);

echo $txt;

?>
doesn't work :(

Posted: Wed Oct 12, 2005 12:02 pm
by foobar
I use the Regex Coach whenever I get stuck with regular expressions. It's a really handy tool and lets you see what exactly your pattern is doing, if at all. Regex's can be a real pain if you don't use them 24/7. They're pretty damn powerful, nevertheless.

Posted: Wed Oct 12, 2005 12:33 pm
by Avram
Okay, I'll try this, but meanwhile, if SOMEONE do know how to solve my problem, PLEASE post solution here!

Posted: Wed Oct 12, 2005 3:44 pm
by Avram
Oh, I just cannot reach the solution. It's so confusing... Help me please. Anyone!?

Posted: Wed Oct 12, 2005 4:36 pm
by feyd
regex is not a really viable solution for this... especially when fairly simple string parsing would do this without a hell of a lot of pain, depending on the level of intelligence it needs to have.

Posted: Wed Oct 12, 2005 4:50 pm
by Nathaniel
I'd replace XHTML with <acronym title="eXtensible HypterText Language">XHTML</acronym>... then you could have your <a> tags around your result and still validate. If I remember the order of the tags correctly.

Posted: Wed Oct 12, 2005 5:15 pm
by Avram
feyd: Please explain how can I do that w/out RegEx?
Nathaniel: That's the final solution, if I can't solve problem with something else, I'll use this, because now I want it to look like this (to use overlib). But, what if I want to explain "html"? What will happen with <html> tag? :(

edit: but even if I use acronym... when the term is in pointing URL of the <A> tag... the url will be messed... so this is not the right way to do this :(

Posted: Wed Oct 12, 2005 8:36 pm
by Jenk

Code: Select all

str_ireplace(' XHTML ', ' <a title="eXtensible HyperText Markup Language">xhtml</a> ');
That would get all instances of XHTML with spaces either side.

Posted: Wed Oct 12, 2005 8:55 pm
by Nathaniel
Lol @ Jenk. Why do us programmers always save the simplest method for last? ;)

Posted: Thu Oct 13, 2005 4:25 am
by Avram
okay... but what if there is "XHTML:" or "XHTML." (last word in sentence)... I think I'm going to put replace for all cases :))

btw. I don't have str_ireplace (php5 only) so I used eregi_replace....

Posted: Thu Oct 13, 2005 7:19 am
by Jenk

Code: Select all

<?php

$string = ' XHTML. osaudnfosjdnfon XHTML: asdasdad http://www.XHTML-boo.com asdasd xhtmlasdasdasd XhTmL';

echo preg_replace('/\s+(XHTML[\s\.:]?)/i', '<a title="eXtensible HyperText Markup Language">$1</a>', $string);

?>
Outputs:

Code: Select all

<a title="eXtensible HyperText Markup Language">XHTML.</a> osaudnfosjdnfon<a title="eXtensible HyperText Markup Language">XHTML:</a> asdasdad http://www.XHTML-boo.com asdasd<a title="eXtensible HyperText Markup Language">xhtml</a>asdasdasd<a title="eXtensible HyperText Markup Language">XhTmL</a>
:)

EDIT: My regex isn't spot on, but that gives a starting point I guess.

Posted: Thu Oct 13, 2005 7:49 am
by feyd
sadly, Jenk's idea, although clever, isn't bulletproof.. unfortunately, I don't have the luxury of enough time to write up a string parser to do this correctly...

Posted: Thu Oct 13, 2005 8:00 am
by Jenk
Am I correct in guessing a tokeniser is needed? :)