Page 1 of 1

optimise regex: ungreedy OR lookahead?

Posted: Thu Oct 26, 2006 4:14 pm
by theYinYeti
Hello,

This is my first post. I've written a PHP framework for web site creation. I'm in the process of rewriting the whole framework with PHP5 (instead of 4) and while I'm at it, I want to optimise my regexps.
Among those is the regexp that parses HTML pages to extract the HEAD and the BODY. The part of the regex I want to rewrite is this one:

Code: Select all

<head[^>]*>\s*((?U).*)\s*</head>\s*<body[^>]*>\s*(.*)</body>
I recently learned how to use (?...) kinds of patterns, and I wondered: what would be the most efficient/optimized between the above and this:

Code: Select all

<head[^>]*>\s*((?>(?:(?!\s*</head>).)*))\s*</head>\s*<body[^>]*>\s*((?>(?:(?!\</body>).)*))</body>
Please note that I don't mind overly complex regexps; I'm just looking for speed at execution time.

My idea here is to specify each character . NOT to be the start (?!...) of the following text eg: </body>, and to further specify that all consecutive such characters (?:(...)*) are to be considered the final match (?>...), ie. no need to test a shorter string. The extra parenthesis are for capturing.
As I view it, this is more or less what the ungreedy (?U) from the first regexp does, since PHP somehow has to read ahead to be ungreedy and stop where it should.
My feeling is that the second regexp is more optimized because:
- PHP does not know that there should be only one </head> and one </body> so skipping an eventual bad first match is useless,
- PHP is ungreedy in that it should stop if it encounters the whole rest of the regexp (lookahead up to the end) whereas a 7-character lookahead is enough.

What do you think?
Besides, do you think I would have any significative gain by changing each \s* with (?>\s*)?
Thanks,

Yves.

Posted: Thu Oct 26, 2006 6:59 pm
by printf
Where are using it, (preg_match, preg_replace, preg_match_all)? If your just wanting to extract the head if(head) and body if(body) then I would combine the head and body and get rid of the extended expression all together, I don't see where grouping or repeated stand alone pattern matching is needed. It's over kill and uses more resources than what is needed to return content between (1, 2) of the same formatted tags. if one or the other is there!


printf

Posted: Fri Oct 27, 2006 5:41 am
by theYinYeti
Thanks for the reply. In short, here's what I have in pseudo-code:

Code: Select all

if <?xml ... encoding="..." ?> ... <head> ... </head> ... <body> ... </body> ...
  ...
elseif ... <head> ... </head> ... <body> ... </body> ...
  ...
else
  ...
A couple of "stripos" and "strripos" may indeed do a better job.

I'm still interested in opinions about the initial post's question, as I have other regexps with the same issue, that won't be as disposable as this one.

Yves.

Posted: Fri Oct 27, 2006 9:36 am
by Mordred
If you don't have control over the .html-s in question, be aware that you won't match

Code: Select all

<  body>
for example.

Posted: Mon Oct 30, 2006 2:20 am
by theYinYeti
Yes, thank you Mordred. I'm aware of that.

As it stands, 99% of pages (if not more) generated by tools like Nvu, Dreamweaver, Bluefish..., or even Frontpage, OpenOffice, MSOffice... have no space between tag-names and the tag's opening '<', nor do they have space between '<', and '/', and the tag-name for the closing tag.

I just want to know about the relative performance of the different regexps as explained in the first post.

Yves.