[SOLVED] Simple REGEX needed...

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

[SOLVED] Simple REGEX needed...

Postby tomfra » Fri Aug 27, 2004 9:07 am

I need to replace a string within html page that looks like:

Syntax: [ Download ] [ Hide ]
onload="if (this.width<50) {this.src='/images/someimage.gif'; this.width='120'; this.height='90'}"


...with an empty string. Generally, I want to get rid of everything or simply the values between quotes. Hovewer, sometimes there can also be a space before / after the "=" sign - i.e. onload = "something".

Any simple and hopefully fast regex for this?

Thanks a lot!

Tomas
Last edited by tomfra on Fri Aug 27, 2004 1:15 pm, edited 1 time in total.
tomfra
Forum Contributor
 
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Postby feyd » Fri Aug 27, 2004 10:20 am

untested
Syntax: [ Download ] [ Hide ]
preg_replace('#onload\s*=\s*([\'"]?).*?\\1\s+#i','',$text);
Last edited by feyd on Wed Aug 24, 2005 7:44 am, edited 1 time in total.
User avatar
feyd
Neighborhood Spidermoddy
 
Posts: 31559
Joined: Mon Mar 29, 2004 4:24 pm
Location: Bothell, Washington, USA

Postby tomfra » Fri Aug 27, 2004 12:00 pm

Thanks Feyd!

I tried this code:

Syntax: [ Download ] [ Hide ]
$page_part = stripslashes(preg_replace("/onload\s*=\s*\"[^>]*\"/i", '', $page_part));


and it seems to be working. Is there anything wrong with this code? You know, it was kind of...luck that I figured it out :)

The reason I need this regex is because of a bug / strange behaviour in strip_tags. It works great but in the above example JavaScript code it has problems with this part:

Syntax: [ Download ] [ Hide ]
(this.width<50)


Or actually with the "&lt;" sign in it. For some reason strip_tags stops converting the html content into plain text at that point. When you try to strip tags on this example:

Syntax: [ Download ] [ Hide ]
<img src="image.gif" onload="if (this.width<50) {this.src='image2.gif'; this.width='120'; this.height='90'}">

<p>This is some text</p>


It will not output anything. If you get rid of the "<" in the JavaScript code - e.g. change it to (this.width=50), everything will work as expected.

Is there a better fix than getting rid of the JS code completely via preg_replace? If not then I will simply use that but there may be more situations when something like this could happen in my opinion.

Thanks!

Tomas
tomfra
Forum Contributor
 
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Postby feyd » Fri Aug 27, 2004 12:20 pm

you could create your own strip_tags.. :D (untested)
Syntax: [ Download ] [ Hide ]
<?php



$stripped = preg_replace('#</?.*?>#','',$text);



?>
User avatar
feyd
Neighborhood Spidermoddy
 
Posts: 31559
Joined: Mon Mar 29, 2004 4:24 pm
Location: Bothell, Washington, USA

Postby tomfra » Fri Aug 27, 2004 12:50 pm

This seems to be working too. I only made a little modification:

Syntax: [ Download ] [ Hide ]
$stripped = preg_replace('#</?.*?\>#','',$text);


I don't know if will will not alter the functionality in some situations though because I know next to nothing about regex. I added backslash because my PsPad couldn't highlight the code properly because it though PHP code was ended with the ? &gt; tag.

One has to wonder why the strip_tags function even exists in PHP when it's buggy and can be replaced with a one-liner. I guess it may be because strip_tags is faster?

Tomas
tomfra
Forum Contributor
 
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Postby feyd » Fri Aug 27, 2004 12:57 pm

strip_tags is there for people who know nothing about the regular expression functions. It is also useful because it is faster (because of compiled code) .. it also will ignore tags, if you send them with it.
User avatar
feyd
Neighborhood Spidermoddy
 
Posts: 31559
Joined: Mon Mar 29, 2004 4:24 pm
Location: Bothell, Washington, USA

Postby tomfra » Wed Sep 01, 2004 4:09 pm

Ok, one very similar problem... sorry ;)

This code has a problem when using the regex above:

Syntax: [ Download ] [ Hide ]
<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45"></A></TD>


It filter's out everything up to the ">50" part but thinks that this part is plain text:

Syntax: [ Download ] [ Hide ]
50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="
45">


It's obviously because of the unexpected ">" sign. Is there a way how to filter out all the code, including the part after ">50" ?

Thanks!

Tomas
tomfra
Forum Contributor
 
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Postby feyd » Wed Sep 01, 2004 4:27 pm

what have you tried so far to fix it on your own?
User avatar
feyd
Neighborhood Spidermoddy
 
Posts: 31559
Joined: Mon Mar 29, 2004 4:24 pm
Location: Bothell, Washington, USA

Postby tomfra » Wed Sep 01, 2004 5:11 pm

I've tried playing with the regex a little but when it comes to regex I am a total newbie so no luck there. I can't think of any "clean" solution. I think I will have to use one regex to get rid of the javascript code and then use the other regex. But that doesn't sound great either, I'd like to keep as little regex code as possible so that it doesn't slow down everything too much.

Tomas
tomfra
Forum Contributor
 
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Postby feyd » Wed Sep 01, 2004 5:19 pm

regex is generally very fast, unless written very very poorly.

Syntax: [ Download ] [ Hide ]
<?php



$test = '<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">

<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"

WIDTH="45">testestets>blah</A></TD>'
;



echo htmlentities(preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?>#s','',$test),ENT_QUOTES);



?>
outputs
Syntax: [ Download ] [ Hide ]
testestets>blah


I'd suggest building a list of valid tags so the matching is a bit more correct in only stripping actual tags. However, that adds a bit to maintainence.
Last edited by feyd on Wed Aug 24, 2005 7:48 am, edited 2 times in total.
User avatar
feyd
Neighborhood Spidermoddy
 
Posts: 31559
Joined: Mon Mar 29, 2004 4:24 pm
Location: Bothell, Washington, USA

Postby tomfra » Wed Sep 01, 2004 6:19 pm

Actually, the regex without htmlentities seems to be working very well.

i.e.:

Syntax: [ Download ] [ Hide ]
preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?\>#s','',$test)


Using the above example it will return this string:

testestets>blah

Which is completely correct because the ">" sign is a part of the text, but it gets rid of the ">" sign in the javascript. I am testing everything now and so far haven't found a problem with it.

I'll really have to learn all this regex stuff...

Thanks again!

Tomas
tomfra
Forum Contributor
 
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic


Return to PHP - Code

Who is online

Users browsing this forum: Yahoo [Bot] and 14 guests