optimise regex: ungreedy OR lookahead?
Posted: Thu Oct 26, 2006 4:14 pm
Hello,
This is my first post. I've written a PHP framework for web site creation. I'm in the process of rewriting the whole framework with PHP5 (instead of 4) and while I'm at it, I want to optimise my regexps.
Among those is the regexp that parses HTML pages to extract the HEAD and the BODY. The part of the regex I want to rewrite is this one:
I recently learned how to use (?...) kinds of patterns, and I wondered: what would be the most efficient/optimized between the above and this:
Please note that I don't mind overly complex regexps; I'm just looking for speed at execution time.
My idea here is to specify each character . NOT to be the start (?!...) of the following text eg: </body>, and to further specify that all consecutive such characters (?:(...)*) are to be considered the final match (?>...), ie. no need to test a shorter string. The extra parenthesis are for capturing.
As I view it, this is more or less what the ungreedy (?U) from the first regexp does, since PHP somehow has to read ahead to be ungreedy and stop where it should.
My feeling is that the second regexp is more optimized because:
- PHP does not know that there should be only one </head> and one </body> so skipping an eventual bad first match is useless,
- PHP is ungreedy in that it should stop if it encounters the whole rest of the regexp (lookahead up to the end) whereas a 7-character lookahead is enough.
What do you think?
Besides, do you think I would have any significative gain by changing each \s* with (?>\s*)?
Thanks,
Yves.
This is my first post. I've written a PHP framework for web site creation. I'm in the process of rewriting the whole framework with PHP5 (instead of 4) and while I'm at it, I want to optimise my regexps.
Among those is the regexp that parses HTML pages to extract the HEAD and the BODY. The part of the regex I want to rewrite is this one:
Code: Select all
<head[^>]*>\s*((?U).*)\s*</head>\s*<body[^>]*>\s*(.*)</body>Code: Select all
<head[^>]*>\s*((?>(?:(?!\s*</head>).)*))\s*</head>\s*<body[^>]*>\s*((?>(?:(?!\</body>).)*))</body>My idea here is to specify each character . NOT to be the start (?!...) of the following text eg: </body>, and to further specify that all consecutive such characters (?:(...)*) are to be considered the final match (?>...), ie. no need to test a shorter string. The extra parenthesis are for capturing.
As I view it, this is more or less what the ungreedy (?U) from the first regexp does, since PHP somehow has to read ahead to be ungreedy and stop where it should.
My feeling is that the second regexp is more optimized because:
- PHP does not know that there should be only one </head> and one </body> so skipping an eventual bad first match is useless,
- PHP is ungreedy in that it should stop if it encounters the whole rest of the regexp (lookahead up to the end) whereas a 7-character lookahead is enough.
What do you think?
Besides, do you think I would have any significative gain by changing each \s* with (?>\s*)?
Thanks,
Yves.