[SOLVED] Simple REGEX needed...

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

[SOLVED] Simple REGEX needed...

Post by tomfra »

I need to replace a string within html page that looks like:

Code: Select all

onload="if (this.width<50) {this.src='/images/someimage.gif'; this.width='120'; this.height='90'}"
...with an empty string. Generally, I want to get rid of everything or simply the values between quotes. Hovewer, sometimes there can also be a space before / after the "=" sign - i.e. onload = "something".

Any simple and hopefully fast regex for this?

Thanks a lot!

Tomas
Last edited by tomfra on Fri Aug 27, 2004 1:15 pm, edited 1 time in total.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

untested

Code: Select all

preg_replace('#onload\s*=\s*([\'"]?).*?\\1\s+#i','',$text);
Last edited by feyd on Wed Aug 24, 2005 7:44 am, edited 1 time in total.
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Post by tomfra »

Thanks Feyd!

I tried this code:

Code: Select all

$page_part = stripslashes(preg_replace("/onload\s*=\s*\"[^>]*\"/i", '', $page_part));
and it seems to be working. Is there anything wrong with this code? You know, it was kind of...luck that I figured it out :)

The reason I need this regex is because of a bug / strange behaviour in strip_tags. It works great but in the above example JavaScript code it has problems with this part:

Code: Select all

(this.width<50)
Or actually with the "<" sign in it. For some reason strip_tags stops converting the html content into plain text at that point. When you try to strip tags on this example:

Code: Select all

<img src="image.gif" onload="if (this.width<50) {this.src='image2.gif'; this.width='120'; this.height='90'}">
<p>This is some text</p>
It will not output anything. If you get rid of the "<" in the JavaScript code - e.g. change it to (this.width=50), everything will work as expected.

Is there a better fix than getting rid of the JS code completely via preg_replace? If not then I will simply use that but there may be more situations when something like this could happen in my opinion.

Thanks!

Tomas
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you could create your own strip_tags.. :D (untested)

Code: Select all

<?php

$stripped = preg_replace('#</?.*?>#','',$text);

?>
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Post by tomfra »

This seems to be working too. I only made a little modification:

Code: Select all

$stripped = preg_replace('#</?.*?\>#','',$text);
I don't know if will will not alter the functionality in some situations though because I know next to nothing about regex. I added backslash because my PsPad couldn't highlight the code properly because it though PHP code was ended with the ? > tag.

One has to wonder why the strip_tags function even exists in PHP when it's buggy and can be replaced with a one-liner. I guess it may be because strip_tags is faster?

Tomas
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

strip_tags is there for people who know nothing about the regular expression functions. It is also useful because it is faster (because of compiled code) .. it also will ignore tags, if you send them with it.
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Post by tomfra »

Ok, one very similar problem... sorry ;)

This code has a problem when using the regex above:

Code: Select all

<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45"></A></TD>
It filter's out everything up to the ">50" part but thinks that this part is plain text:

Code: Select all

50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45">
It's obviously because of the unexpected ">" sign. Is there a way how to filter out all the code, including the part after ">50" ?

Thanks!

Tomas
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

what have you tried so far to fix it on your own?
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Post by tomfra »

I've tried playing with the regex a little but when it comes to regex I am a total newbie so no luck there. I can't think of any "clean" solution. I think I will have to use one regex to get rid of the javascript code and then use the other regex. But that doesn't sound great either, I'd like to keep as little regex code as possible so that it doesn't slow down everything too much.

Tomas
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

regex is generally very fast, unless written very very poorly.

Code: Select all

<?php

$test = '<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45">testestets>blah</A></TD>';

echo htmlentities(preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?>#s','',$test),ENT_QUOTES);

?>
outputs

Code: Select all

testestets>blah
I'd suggest building a list of valid tags so the matching is a bit more correct in only stripping actual tags. However, that adds a bit to maintainence.
Last edited by feyd on Wed Aug 24, 2005 7:48 am, edited 2 times in total.
tomfra
Forum Contributor
Posts: 126
Joined: Wed Jun 23, 2004 12:56 pm
Location: Prague, Czech Republic

Post by tomfra »

Actually, the regex without htmlentities seems to be working very well.

i.e.:

Code: Select all

preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?\>#s','',$test)
Using the above example it will return this string:

testestets>blah

Which is completely correct because the ">" sign is a part of the text, but it gets rid of the ">" sign in the javascript. I am testing everything now and so far haven't found a problem with it.

I'll really have to learn all this regex stuff...

Thanks again!

Tomas
Post Reply