Page 1 of 1
[SOLVED] Simple REGEX needed...
Posted: Fri Aug 27, 2004 9:07 am
by tomfra
I need to replace a string within html page that looks like:
Code: Select all
onload="if (this.width<50) {this.src='/images/someimage.gif'; this.width='120'; this.height='90'}"
...with an empty string. Generally, I want to get rid of everything or simply the values between quotes. Hovewer, sometimes there can also be a space before / after the "=" sign - i.e. onload = "something".
Any simple and hopefully fast regex for this?
Thanks a lot!
Tomas
Posted: Fri Aug 27, 2004 10:20 am
by feyd
untested
Code: Select all
preg_replace('#onload\s*=\s*([\'"]?).*?\\1\s+#i','',$text);
Posted: Fri Aug 27, 2004 12:00 pm
by tomfra
Thanks Feyd!
I tried this code:
Code: Select all
$page_part = stripslashes(preg_replace("/onload\s*=\s*\"[^>]*\"/i", '', $page_part));
and it seems to be working. Is there anything wrong with this code? You know, it was kind of...luck that I figured it out
The reason I need this regex is because of a bug / strange behaviour in strip_tags. It works great but in the above example JavaScript code it has problems with this part:
Or actually with the "<" sign in it. For some reason strip_tags stops converting the html content into plain text at that point. When you try to strip tags on this example:
Code: Select all
<img src="image.gif" onload="if (this.width<50) {this.src='image2.gif'; this.width='120'; this.height='90'}">
<p>This is some text</p>
It will not output anything. If you get rid of the "<" in the JavaScript code - e.g. change it to (this.width=50), everything will work as expected.
Is there a better fix than getting rid of the JS code completely via preg_replace? If not then I will simply use that but there may be more situations when something like this could happen in my opinion.
Thanks!
Tomas
Posted: Fri Aug 27, 2004 12:20 pm
by feyd
you could create your own strip_tags..

(untested)
Code: Select all
<?php
$stripped = preg_replace('#</?.*?>#','',$text);
?>
Posted: Fri Aug 27, 2004 12:50 pm
by tomfra
This seems to be working too. I only made a little modification:
Code: Select all
$stripped = preg_replace('#</?.*?\>#','',$text);
I don't know if will will not alter the functionality in some situations though because I know next to nothing about regex. I added backslash because my PsPad couldn't highlight the code properly because it though PHP code was ended with the ? > tag.
One has to wonder why the strip_tags function even exists in PHP when it's buggy and can be replaced with a one-liner. I guess it may be because strip_tags is faster?
Tomas
Posted: Fri Aug 27, 2004 12:57 pm
by feyd
strip_tags is there for people who know nothing about the regular expression functions. It is also useful because it is faster (because of compiled code) .. it also will ignore tags, if you send them with it.
Posted: Wed Sep 01, 2004 4:09 pm
by tomfra
Ok, one very similar problem... sorry
This code has a problem when using the regex above:
Code: Select all
<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45"></A></TD>
It filter's out everything up to the ">50" part but thinks that this part is plain text:
Code: Select all
50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45">
It's obviously because of the unexpected ">" sign. Is there a way how to filter out all the code, including the part after ">50" ?
Thanks!
Tomas
Posted: Wed Sep 01, 2004 4:27 pm
by feyd
what have you tried so far to fix it on your own?
Posted: Wed Sep 01, 2004 5:11 pm
by tomfra
I've tried playing with the regex a little but when it comes to regex I am a total newbie so no luck there. I can't think of any "clean" solution. I think I will have to use one regex to get rid of the javascript code and then use the other regex. But that doesn't sound great either, I'd like to keep as little regex code as possible so that it doesn't slow down everything too much.
Tomas
Posted: Wed Sep 01, 2004 5:19 pm
by feyd
regex is generally very fast, unless written very very poorly.
Code: Select all
<?php
$test = '<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45">testestets>blah</A></TD>';
echo htmlentities(preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?>#s','',$test),ENT_QUOTES);
?>
outputs
I'd suggest building a list of valid tags so the matching is a bit more correct in only stripping actual tags. However, that adds a bit to maintainence.
Posted: Wed Sep 01, 2004 6:19 pm
by tomfra
Actually, the regex without htmlentities seems to be working very well.
i.e.:
Code: Select all
preg_replace('#<.*?(\s+[\w\W]+?(\s*=\s*([\'"]?).*?\\3))*?\>#s','',$test)
Using the above example it will return this string:
testestets>blah
Which is completely correct because the ">" sign is a part of the text, but it gets rid of the ">" sign in the javascript. I am testing everything now and so far haven't found a problem with it.
I'll really have to learn all this regex stuff...
Thanks again!
Tomas