Compse JavaScript Regular Expression

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
JunbinDuan
Forum Newbie
Posts: 4
Joined: Fri Aug 13, 2010 3:28 pm

Compse JavaScript Regular Expression

Post by JunbinDuan »

Hi,

I have an HTML document that contains following piece:

<THEAD>
<TR style="DISPLAY: none">
....
</TR></THEAD>

I need to get rid of this piece from the document. I wrote following javascript (editorContent holds the html document):

var re = new RegExp("<THEAD>[.\s]*<TR style="DISPLAY: none\">[.\s]*</TR></THEAD>", "gi");
editorContent = editorContent.replace(re, ";");

But it does not seem to work. Does anybody know what wrong with the express?

Thank you very much,
Junbin
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Compse JavaScript Regular Expression

Post by prometheuzz »

Note that "does not seem to work" is a rather vague problem description.

But I think I can make an educated guess what you mean: inside a character class, the DOT does not match any character (except line breaks), but it matches the literal '.'

So, instead of [.\s]*, try [\S\s]*
JunbinDuan
Forum Newbie
Posts: 4
Joined: Fri Aug 13, 2010 3:28 pm

Re: Compse JavaScript Regular Expression

Post by JunbinDuan »

Thanks for the reply. I have updated my scripts to following:

var re = new RegExp(";[^;]*ResizeImage.this.;", "gi");
editorContent = editorContent.replace(re, ";");

re = new RegExp("<THEAD>[\S\s]*<TR style=\"DISPLAY: none\">[\S\s]*</TR></THEAD>", "gi");
editorContent = editorContent.replace(re, "");

The first expression works, but the second does not - it did not change the value of editorContent. Any ideas?

Thanks.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Compse JavaScript Regular Expression

Post by prometheuzz »

... Any ideas?
Yes: then there is no string in your input that matches your regex.
JunbinDuan
Forum Newbie
Posts: 4
Joined: Fri Aug 13, 2010 3:28 pm

Re: Compse JavaScript Regular Expression

Post by JunbinDuan »

Of course the input has the string.

thx
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Compse JavaScript Regular Expression

Post by prometheuzz »

Of course the input has the string.
Well, if that were the case, there would be a match right?

Again: telling "it doesn't work" tells absolutely nothing about the problem you're facing! For people who do have the patience to try and help you, I suggest you post the actual input (or the relevant part of it) that you think should match but doesn't.

Best of luck.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Compse JavaScript Regular Expression

Post by ridgerunner »

One reason that your regex is not working is because it is being specified in a "string". In a Javascript regex string you must use two backslashes to get one to appear in the regex. A single backslash in front of anything other than a recognized metacharacter is simply discarded (i.e. the [\S\s] in your regex string is being seen as: [Ss] by the regex engine.) Thus, your regex should work quite a bit better written like this:

Code: Select all

re = new RegExp("<THEAD>[\\S\\s]*<TR style=\"DISPLAY: none\">[\\S\\s]*</TR></THEAD>", "gi");
editorContent = editorContent.replace(re, "");
Or better yet, use a native Javascript literal regex to specify your pattern like so:

Code: Select all

re = /<THEAD>[\S\s]*<TR style="DISPLAY: none">[\S\s]*<\/TR><\/THEAD>/gi;
editorContent = editorContent.replace(re, "");
With this syntax you don't need to escape the quotes or the backslashes, (but you do need to escape the forward slashes which are used as delimiters for Javascript regex literals).

Another important point: your regex would go much faster and will be more accurate if you use lazy quantifiers. The greedy /[\S\s]*/ matches everything all the way to the end of the document and must (slowly) backtrack one character at a time to get back. (And your regex has two of these - which can also lead to catastrophic backtracking if you have a mal-formed HTML document.) Worse, if you have more than one table in your HTML markup with these types of THEADs, the regex will erroneously delete everything between the first THEAD and the last THEAD, (which is probably not what you want!) To fix this potential problem, add the lazy '?' modifier to both the * star quantifiers like so: (Also there is no need to match the closing TR tag).

Code: Select all

re = /<THEAD>[\S\s]*?<TR style="DISPLAY: none">[\S\s]*?<\/THEAD>/ig;
editorContent = editorContent.replace(re, "");
But this can be improved further. Assuming that the TR element immediately follows the THEAD (with some possible whitespace in between), we can replace the first [\S\s]* with simply \s* and this will eliminate any possible catastrophic backtracking. Also, the TR start tag may have some other attributes other than just STYLE (and the style attribute value itself may have additional selectors other just "DISPLAY: none"). The following regex is quite a bit more complex but I think you'll find that it matches better and runs much faster:

Code: Select all

re = /<THEAD\b[^>]*>\s*<TR\b[^>]*?style="[^"]*?DISPLAY:\s*none[^"]*"[^>]*>[^<]*(?:(?!<\/THEAD>)<[^<]*)*<\/THEAD>/ig;
editorContent = editorContent.replace(re, "");
It also allows the THEAD tag to have attributes and uses the efficient "unrolling-the-loop" technique for matching the bulk of the stuff between the tags.

Hope this helps!
:)
JunbinDuan
Forum Newbie
Posts: 4
Joined: Fri Aug 13, 2010 3:28 pm

Re: Compse JavaScript Regular Expression

Post by JunbinDuan »

Hi Ridgerunner,

Your suggestion just worked perfectly. Thank you very much for your patience and expertise. You are the star.

Junbin
Post Reply