removing nested javascript from html tags with a Perl REGEX

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
cyril11
Forum Newbie
Posts: 4
Joined: Tue Jul 01, 2003 4:59 pm

removing nested javascript from html tags with a Perl REGEX

Post by cyril11 »

Hi,
I am looking for a nice perl regular expression that may match on* attributes in all possible html tags (eg onclick, onmouseover, etc ...), whatever case they're in.
Does anyone have it by chance? I tried to do my own but it is really not my cup of tea.
Thanks a lot.
Cyril
m3rajk
DevNet Resident
Posts: 1191
Joined: Mon Jun 02, 2003 3:37 pm

Post by m3rajk »

lol. i recently made one.

how i created mine: wrote all the inlines i know.
noticed they all start with "on"
noticed they range between 4 and 9 characters
the all have alpha characters
case is irrelevant
the must have "" on the equal side

therefor you need on\w{4,9} to start the pattern. but this isn't enough. what if new ones are made or i missed any? well it's obvious it starts with on, so start the pattern (on\w+)

now what's next? you can have a space, must have an = and then another optional space (\s*=\s*)

and then there's the next side, which the boundries are " and " thus you need everything from the first " that isn't the second " and the second " ("[^"]*")

i don't like to just give the code without the person understanding what's behind it, that's why i gave this like i did however, you know have the pattern using the perl shorts. you should be able to modify to posix if you desire.

and unlike giving you the code straight out, this should show you what you need by section.

also, if you're that lazy, i think it's in one of my previous posts... or maybe all i did was link to it so you missed it
cyril11
Forum Newbie
Posts: 4
Joined: Tue Jul 01, 2003 4:59 pm

Post by cyril11 »

Beware of loose html code!
I have just found out that attributes value do not even need to be in quotes (which can be simple or double quotes by the way) to be parsed. On IE5 for instance,

Code: Select all

<p onmouseover=alert('boo')>
is valid javascript and this need to be removed by the regex as well!
Here is the last regex

Code: Select all

<?php

//sample list, full one on the w3 web site
define("DISALLOWED_ATTRIBUTE_LIST","onblur|onchange|onclick|ondblclick");

//testing $str
 if (preg_match_all(/('. DISALLOWED_ATTRIBUTE_LIST .')\s*=\s*("[^"]*"|''[^'']*''|[^ >]*)/i,$str,$aAttriMatch)){
  $aError=&$aAttriMatch[0];
  $bErr=1;
}

P.S m3rajk thank you for your help 

?>
Post Reply