optimise regex: ungreedy OR lookahead?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
theYinYeti
Forum Newbie
Posts: 15
Joined: Thu Oct 26, 2006 3:33 pm
Location: France

optimise regex: ungreedy OR lookahead?

Post by theYinYeti »

Hello,

This is my first post. I've written a PHP framework for web site creation. I'm in the process of rewriting the whole framework with PHP5 (instead of 4) and while I'm at it, I want to optimise my regexps.
Among those is the regexp that parses HTML pages to extract the HEAD and the BODY. The part of the regex I want to rewrite is this one:

Code: Select all

<head[^>]*>\s*((?U).*)\s*</head>\s*<body[^>]*>\s*(.*)</body>
I recently learned how to use (?...) kinds of patterns, and I wondered: what would be the most efficient/optimized between the above and this:

Code: Select all

<head[^>]*>\s*((?>(?:(?!\s*</head>).)*))\s*</head>\s*<body[^>]*>\s*((?>(?:(?!\</body>).)*))</body>
Please note that I don't mind overly complex regexps; I'm just looking for speed at execution time.

My idea here is to specify each character . NOT to be the start (?!...) of the following text eg: </body>, and to further specify that all consecutive such characters (?:(...)*) are to be considered the final match (?>...), ie. no need to test a shorter string. The extra parenthesis are for capturing.
As I view it, this is more or less what the ungreedy (?U) from the first regexp does, since PHP somehow has to read ahead to be ungreedy and stop where it should.
My feeling is that the second regexp is more optimized because:
- PHP does not know that there should be only one </head> and one </body> so skipping an eventual bad first match is useless,
- PHP is ungreedy in that it should stop if it encounters the whole rest of the regexp (lookahead up to the end) whereas a 7-character lookahead is enough.

What do you think?
Besides, do you think I would have any significative gain by changing each \s* with (?>\s*)?
Thanks,

Yves.
printf
Forum Contributor
Posts: 173
Joined: Wed Jan 12, 2005 5:24 pm

Post by printf »

Where are using it, (preg_match, preg_replace, preg_match_all)? If your just wanting to extract the head if(head) and body if(body) then I would combine the head and body and get rid of the extended expression all together, I don't see where grouping or repeated stand alone pattern matching is needed. It's over kill and uses more resources than what is needed to return content between (1, 2) of the same formatted tags. if one or the other is there!


printf
User avatar
theYinYeti
Forum Newbie
Posts: 15
Joined: Thu Oct 26, 2006 3:33 pm
Location: France

Post by theYinYeti »

Thanks for the reply. In short, here's what I have in pseudo-code:

Code: Select all

if <?xml ... encoding="..." ?> ... <head> ... </head> ... <body> ... </body> ...
  ...
elseif ... <head> ... </head> ... <body> ... </body> ...
  ...
else
  ...
A couple of "stripos" and "strripos" may indeed do a better job.

I'm still interested in opinions about the initial post's question, as I have other regexps with the same issue, that won't be as disposable as this one.

Yves.
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

If you don't have control over the .html-s in question, be aware that you won't match

Code: Select all

<  body>
for example.
User avatar
theYinYeti
Forum Newbie
Posts: 15
Joined: Thu Oct 26, 2006 3:33 pm
Location: France

Post by theYinYeti »

Yes, thank you Mordred. I'm aware of that.

As it stands, 99% of pages (if not more) generated by tools like Nvu, Dreamweaver, Bluefish..., or even Frontpage, OpenOffice, MSOffice... have no space between tag-names and the tag's opening '<', nor do they have space between '<', and '/', and the tag-name for the closing tag.

I just want to know about the relative performance of the different regexps as explained in the first post.

Yves.
Post Reply