Page 1 of 1

Compse JavaScript Regular Expression

Posted: Fri Aug 13, 2010 3:44 pm
by JunbinDuan
Hi,

I have an HTML document that contains following piece:

<THEAD>
<TR style="DISPLAY: none">
....
</TR></THEAD>

I need to get rid of this piece from the document. I wrote following javascript (editorContent holds the html document):

var re = new RegExp("<THEAD>[.\s]*<TR style="DISPLAY: none\">[.\s]*</TR></THEAD>", "gi");
editorContent = editorContent.replace(re, ";");

But it does not seem to work. Does anybody know what wrong with the express?

Thank you very much,
Junbin

Re: Compse JavaScript Regular Expression

Posted: Fri Aug 13, 2010 3:54 pm
by prometheuzz
Note that "does not seem to work" is a rather vague problem description.

But I think I can make an educated guess what you mean: inside a character class, the DOT does not match any character (except line breaks), but it matches the literal '.'

So, instead of [.\s]*, try [\S\s]*

Re: Compse JavaScript Regular Expression

Posted: Fri Aug 13, 2010 4:17 pm
by JunbinDuan
Thanks for the reply. I have updated my scripts to following:

var re = new RegExp(";[^;]*ResizeImage.this.;", "gi");
editorContent = editorContent.replace(re, ";");

re = new RegExp("<THEAD>[\S\s]*<TR style=\"DISPLAY: none\">[\S\s]*</TR></THEAD>", "gi");
editorContent = editorContent.replace(re, "");

The first expression works, but the second does not - it did not change the value of editorContent. Any ideas?

Thanks.

Re: Compse JavaScript Regular Expression

Posted: Sat Aug 14, 2010 1:42 am
by prometheuzz
... Any ideas?
Yes: then there is no string in your input that matches your regex.

Re: Compse JavaScript Regular Expression

Posted: Sat Aug 14, 2010 6:58 am
by JunbinDuan
Of course the input has the string.

thx

Re: Compse JavaScript Regular Expression

Posted: Sat Aug 14, 2010 8:05 am
by prometheuzz
Of course the input has the string.
Well, if that were the case, there would be a match right?

Again: telling "it doesn't work" tells absolutely nothing about the problem you're facing! For people who do have the patience to try and help you, I suggest you post the actual input (or the relevant part of it) that you think should match but doesn't.

Best of luck.

Re: Compse JavaScript Regular Expression

Posted: Sun Aug 15, 2010 10:29 pm
by ridgerunner
One reason that your regex is not working is because it is being specified in a "string". In a Javascript regex string you must use two backslashes to get one to appear in the regex. A single backslash in front of anything other than a recognized metacharacter is simply discarded (i.e. the [\S\s] in your regex string is being seen as: [Ss] by the regex engine.) Thus, your regex should work quite a bit better written like this:

Code: Select all

re = new RegExp("<THEAD>[\\S\\s]*<TR style=\"DISPLAY: none\">[\\S\\s]*</TR></THEAD>", "gi");
editorContent = editorContent.replace(re, "");
Or better yet, use a native Javascript literal regex to specify your pattern like so:

Code: Select all

re = /<THEAD>[\S\s]*<TR style="DISPLAY: none">[\S\s]*<\/TR><\/THEAD>/gi;
editorContent = editorContent.replace(re, "");
With this syntax you don't need to escape the quotes or the backslashes, (but you do need to escape the forward slashes which are used as delimiters for Javascript regex literals).

Another important point: your regex would go much faster and will be more accurate if you use lazy quantifiers. The greedy /[\S\s]*/ matches everything all the way to the end of the document and must (slowly) backtrack one character at a time to get back. (And your regex has two of these - which can also lead to catastrophic backtracking if you have a mal-formed HTML document.) Worse, if you have more than one table in your HTML markup with these types of THEADs, the regex will erroneously delete everything between the first THEAD and the last THEAD, (which is probably not what you want!) To fix this potential problem, add the lazy '?' modifier to both the * star quantifiers like so: (Also there is no need to match the closing TR tag).

Code: Select all

re = /<THEAD>[\S\s]*?<TR style="DISPLAY: none">[\S\s]*?<\/THEAD>/ig;
editorContent = editorContent.replace(re, "");
But this can be improved further. Assuming that the TR element immediately follows the THEAD (with some possible whitespace in between), we can replace the first [\S\s]* with simply \s* and this will eliminate any possible catastrophic backtracking. Also, the TR start tag may have some other attributes other than just STYLE (and the style attribute value itself may have additional selectors other just "DISPLAY: none"). The following regex is quite a bit more complex but I think you'll find that it matches better and runs much faster:

Code: Select all

re = /<THEAD\b[^>]*>\s*<TR\b[^>]*?style="[^"]*?DISPLAY:\s*none[^"]*"[^>]*>[^<]*(?:(?!<\/THEAD>)<[^<]*)*<\/THEAD>/ig;
editorContent = editorContent.replace(re, "");
It also allows the THEAD tag to have attributes and uses the efficient "unrolling-the-loop" technique for matching the bulk of the stuff between the tags.

Hope this helps!
:)

Re: Compse JavaScript Regular Expression

Posted: Mon Aug 16, 2010 10:25 am
by JunbinDuan
Hi Ridgerunner,

Your suggestion just worked perfectly. Thank you very much for your patience and expertise. You are the star.

Junbin