Parsing HTML - how to avoid tags?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Parsing HTML - how to avoid tags?

Post by Avram »

Okay, here is the problem:

Before any output at the page, I am using ob_start(); ... when the page loads, I use the output from buffer to "parse" it, 'cause I need to "explain" some terms (take-a-look). But I was "parsing" the buffer with simple str_replace. Now, this will mess up links and other tags within the page. My question is how to avoid "touching" html tags?

E.g. I need to explain term "xhtml". All terms "xhtml" will be "parsed" (replaced) with e.g. "<a title='eXtensible HyperText Markup Language'>xhtml</a>"... but if I have "<a href='http://super-xhtml-site.com'>link</a>", then my current "parsing" method will make: "<a href='http://super-<a title='eXtensible HyperText Markup Language'>xhtml</a>-site.com'>link</a>" <- and you see what will happen...
How to avoid that? (Don't tell me to watch which addresses will I place in links, 'cause maybe all users will be able to post some text (and links)).

Thanks in advance!
User avatar
Skara
Forum Regular
Posts: 703
Joined: Sat Mar 12, 2005 7:13 pm
Location: US

Post by Skara »

this belongs in regex.

This should make sure "xhtml" isn't inside a tag already.

Code: Select all

$txt = preg_replace('/xhtml(?>![^<]+?>)/i','<yourinfotag>xhtml</yourinfotag>',$txt);
I think that's right. Untested.
Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Post by Avram »

tested with this (want to test before implementing to my app, just to make sure):

Code: Select all

<?php
ob_start();
?>

<a href='blahblah'>test</a> - blah

<?php

$txt = ob_get_contents();
ob_end_clean();

$txt = preg_replace('/blah(?>![^<]+?>)/i','<b>blah</b>',$txt);

echo $txt;

?>
doesn't work :(
foobar
Forum Regular
Posts: 613
Joined: Wed Sep 28, 2005 10:08 am

Post by foobar »

I use the Regex Coach whenever I get stuck with regular expressions. It's a really handy tool and lets you see what exactly your pattern is doing, if at all. Regex's can be a real pain if you don't use them 24/7. They're pretty damn powerful, nevertheless.
Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Post by Avram »

Okay, I'll try this, but meanwhile, if SOMEONE do know how to solve my problem, PLEASE post solution here!
Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Post by Avram »

Oh, I just cannot reach the solution. It's so confusing... Help me please. Anyone!?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

regex is not a really viable solution for this... especially when fairly simple string parsing would do this without a hell of a lot of pain, depending on the level of intelligence it needs to have.
User avatar
Nathaniel
Forum Contributor
Posts: 396
Joined: Wed Aug 31, 2005 5:58 pm
Location: Arkansas, USA

Post by Nathaniel »

I'd replace XHTML with <acronym title="eXtensible HypterText Language">XHTML</acronym>... then you could have your <a> tags around your result and still validate. If I remember the order of the tags correctly.
Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Post by Avram »

feyd: Please explain how can I do that w/out RegEx?
Nathaniel: That's the final solution, if I can't solve problem with something else, I'll use this, because now I want it to look like this (to use overlib). But, what if I want to explain "html"? What will happen with <html> tag? :(

edit: but even if I use acronym... when the term is in pointing URL of the <A> tag... the url will be messed... so this is not the right way to do this :(
User avatar
Jenk
DevNet Master
Posts: 3587
Joined: Mon Sep 19, 2005 6:24 am
Location: London

Post by Jenk »

Code: Select all

str_ireplace(' XHTML ', ' <a title="eXtensible HyperText Markup Language">xhtml</a> ');
That would get all instances of XHTML with spaces either side.
User avatar
Nathaniel
Forum Contributor
Posts: 396
Joined: Wed Aug 31, 2005 5:58 pm
Location: Arkansas, USA

Post by Nathaniel »

Lol @ Jenk. Why do us programmers always save the simplest method for last? ;)
Avram
Forum Newbie
Posts: 7
Joined: Tue Oct 11, 2005 5:45 pm
Location: Mladenovac, SCG
Contact:

Post by Avram »

okay... but what if there is "XHTML:" or "XHTML." (last word in sentence)... I think I'm going to put replace for all cases :))

btw. I don't have str_ireplace (php5 only) so I used eregi_replace....
User avatar
Jenk
DevNet Master
Posts: 3587
Joined: Mon Sep 19, 2005 6:24 am
Location: London

Post by Jenk »

Code: Select all

<?php

$string = ' XHTML. osaudnfosjdnfon XHTML: asdasdad http://www.XHTML-boo.com asdasd xhtmlasdasdasd XhTmL';

echo preg_replace('/\s+(XHTML[\s\.:]?)/i', '<a title="eXtensible HyperText Markup Language">$1</a>', $string);

?>
Outputs:

Code: Select all

<a title="eXtensible HyperText Markup Language">XHTML.</a> osaudnfosjdnfon<a title="eXtensible HyperText Markup Language">XHTML:</a> asdasdad http://www.XHTML-boo.com asdasd<a title="eXtensible HyperText Markup Language">xhtml</a>asdasdasd<a title="eXtensible HyperText Markup Language">XhTmL</a>
:)

EDIT: My regex isn't spot on, but that gives a starting point I guess.
Last edited by Jenk on Thu Oct 13, 2005 7:58 am, edited 1 time in total.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

sadly, Jenk's idea, although clever, isn't bulletproof.. unfortunately, I don't have the luxury of enough time to write up a string parser to do this correctly...
User avatar
Jenk
DevNet Master
Posts: 3587
Joined: Mon Sep 19, 2005 6:24 am
Location: London

Post by Jenk »

Am I correct in guessing a tokeniser is needed? :)
Post Reply