PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Fri Nov 28, 2014 2:01 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 11 posts ] 
Author Message
PostPosted: Fri Aug 27, 2004 9:07 am 
Offline
Forum Contributor

Joined: Wed Jun 23, 2004 12:56 pm
Posts: 126
Location: Prague, Czech Republic
I need to replace a string within html page that looks like:

Syntax: [ Download ] [ Hide ]
onload="if (this.width<50) {this.src='/images/someimage.gif'; this.width='120'; this.height='90'}"


...with an empty string. Generally, I want to get rid of everything or simply the values between quotes. Hovewer, sometimes there can also be a space before / after the "=" sign - i.e. onload = "something".

Any simple and hopefully fast regex for this?

Thanks a lot!

Tomas


Last edited by tomfra on Fri Aug 27, 2004 1:15 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Fri Aug 27, 2004 10:20 am 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
untested
Syntax: [ Download ] [ Hide ]
preg_replace('#onload\s*=\s*([\'"]?).*?\\1\s+#i','',$text);


Last edited by feyd on Wed Aug 24, 2005 7:44 am, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Fri Aug 27, 2004 12:00 pm 
Offline
Forum Contributor

Joined: Wed Jun 23, 2004 12:56 pm
Posts: 126
Location: Prague, Czech Republic
Thanks Feyd!

I tried this code:

Syntax: [ Download ] [ Hide ]
$page_part = stripslashes(preg_replace("/onload\s*=\s*\"[^>]*\"/i", '', $page_part));


and it seems to be working. Is there anything wrong with this code? You know, it was kind of...luck that I figured it out :)

The reason I need this regex is because of a bug / strange behaviour in strip_tags. It works great but in the above example JavaScript code it has problems with this part:

Syntax: [ Download ] [ Hide ]
(this.width<50)


Or actually with the "&lt;" sign in it. For some reason strip_tags stops converting the html content into plain text at that point. When you try to strip tags on this example:

Syntax: [ Download ] [ Hide ]
<img src="image.gif" onload="if (this.width<50) {this.src='image2.gif'; this.width='120'; this.height='90'}">

<p>This is some text</p>


It will not output anything. If you get rid of the "<" in the JavaScript code - e.g. change it to (this.width=50), everything will work as expected.

Is there a better fix than getting rid of the JS code completely via preg_replace? If not then I will simply use that but there may be more situations when something like this could happen in my opinion.

Thanks!

Tomas


Top
 Profile  
 
 Post subject:
PostPosted: Fri Aug 27, 2004 12:20 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
you could create your own strip_tags.. :D (untested)
Syntax: [ Download ] [ Hide ]
<?php



$stripped = preg_replace('#</?.*?>#','',$text);



?>


Top
 Profile  
 
 Post subject:
PostPosted: Fri Aug 27, 2004 12:50 pm 
Offline
Forum Contributor

Joined: Wed Jun 23, 2004 12:56 pm
Posts: 126
Location: Prague, Czech Republic
This seems to be working too. I only made a little modification:

Syntax: [ Download ] [ Hide ]
$stripped = preg_replace('#</?.*?\>#','',$text);


I don't know if will will not alter the functionality in some situations though because I know next to nothing about regex. I added backslash because my PsPad couldn't highlight the code properly because it though PHP code was ended with the ? &gt; tag.

One has to wonder why the strip_tags function even exists in PHP when it's buggy and can be replaced with a one-liner. I guess it may be because strip_tags is faster?

Tomas


Top
 Profile  
 
 Post subject:
PostPosted: Fri Aug 27, 2004 12:57 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
strip_tags is there for people who know nothing about the regular expression functions. It is also useful because it is faster (because of compiled code) .. it also will ignore tags, if you send them with it.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 01, 2004 4:09 pm 
Offline
Forum Contributor

Joined: Wed Jun 23, 2004 12:56 pm
Posts: 126
Location: Prague, Czech Republic
Ok, one very similar problem... sorry ;)

This code has a problem when using the regex above:

Syntax: [ Download ] [ Hide ]
<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45"></A></TD>


It filter's out everything up to the ">50" part but thinks that this part is plain text:

Syntax: [ Download ] [ Hide ]
50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="
45">


It's obviously because of the unexpected ">" sign. Is there a way how to filter out all the code, including the part after ">50" ?

Thanks!

Tomas


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 01, 2004 4:27 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
what have you tried so far to fix it on your own?


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 01, 2004 5:11 pm 
Offline
Forum Contributor

Joined: Wed Jun 23, 2004 12:56 pm
Posts: 126
Location: Prague, Czech Republic
I've tried playing with the regex a little but when it comes to regex I am a total newbie so no luck there. I can't think of any "clean" solution. I think I will have to use one regex to get rid of the javascript code and then use the other regex. But that doesn't sound great either, I'd like to keep as little regex code as possible so that it doesn't slow down everything too much.

Tomas


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 01, 2004 5:19 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
regex is generally very fast, unless written very very poorly.

Syntax: [ Download ] [ Hide ]
<?php



$test = '<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">

<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"

WIDTH="45">testestets>blah</A></TD>'
;



echo htmlentities(preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?>#s','',$test),ENT_QUOTES);



?>
outputs
Syntax: [ Download ] [ Hide ]
testestets>blah


I'd suggest building a list of valid tags so the matching is a bit more correct in only stripping actual tags. However, that adds a bit to maintainence.


Last edited by feyd on Wed Aug 24, 2005 7:48 am, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 01, 2004 6:19 pm 
Offline
Forum Contributor

Joined: Wed Jun 23, 2004 12:56 pm
Posts: 126
Location: Prague, Czech Republic
Actually, the regex without htmlentities seems to be working very well.

i.e.:

Syntax: [ Download ] [ Hide ]
preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?\>#s','',$test)


Using the above example it will return this string:

testestets>blah

Which is completely correct because the ">" sign is a part of the text, but it gets rid of the ">" sign in the javascript. I am testing everything now and so far haven't found a problem with it.

I'll really have to learn all this regex stuff...

Thanks again!

Tomas


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: Bing [Bot], Google [Bot], Yahoo [Bot] and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group