Page 1 of 1

Extract PHP blocks?

Posted: Sat Feb 21, 2009 3:40 pm
by alex.barylski
I am trying to build a regex to extract all the PHP blocks within a source file, I have something like:

Code: Select all

 
$matches = array();
preg_match_all('#<\?php(.+)#', $source, $matches, PREG_SET_ORDER);
 
It doesn't work quite as expected...it seems to stop when it reaches the first linebreak and obviously is missing additional tests to be complete...

I would like it to be as robust as possible, taking into account the "last" closing PHP block is not required and that the starting PHP block can be either <? or <?php (<% is not required).

How do I make the <? required and the trailing 'php' optional without grouping the 'php' in a [] or ()

Also, do I need to use $ to tell the regex to continue until the end of the source file or is this already greedy?

p.s-Extracting <script type="php"> is not required either -- this isn't for security but source code metrics

Cheers,
Alex

Re: Extract PHP blocks?

Posted: Sat Feb 21, 2009 5:16 pm
by Weirdan
You need multiline regexp, that's what 'm' and 's' regexp modifiers are for. For 'php' part to be optional, but to not be included in captured matches, you need non-capturing parenthesis: (?:something)? makes 'something' optional, but not captures it.

Re: Extract PHP blocks?

Posted: Sat Feb 21, 2009 5:19 pm
by semlar
PCSpectra wrote:How do I make the <? required and the trailing 'php' optional without grouping the 'php' in a [] or ()
You can't do that. The only reason I can think of that you wouldn't want to group something is to prevent a backreference, which you can do like this.. <\?(?:php)?

The reason your regex isn't matching the entire file is because a dot character does not match line breaks by default. You can make a dot match a new line with the "s" flag like this (or by setting it after the delimiter).. (?s:.+)

I think the pattern I would personally use for this would be <\?(?i:php\s)?((?:[^?]|\?(?!>))+)\?>

Match <? literally
Optional "php" followed by a space
???
Profit

Re: Extract PHP blocks?

Posted: Sat Feb 21, 2009 5:40 pm
by alex.barylski
Thanks for the speedy replies...

QUick question before I go and use the above regex...does this regex work in preg_match_all() or is it POSIX regex?

Re: Extract PHP blocks?

Posted: Sat Feb 21, 2009 6:41 pm
by Weirdan
It's pcre (preg_*). I don't think people use posix regexps nowadays :)

Re: Extract PHP blocks?

Posted: Sun Feb 22, 2009 3:37 pm
by prometheuzz
PCSpectra wrote:I am trying to build a regex to extract all the PHP blocks within a source file, I have something like:

Code: Select all

 
$matches = array();
preg_match_all('#<\?php(.+)#', $source, $matches, PREG_SET_ORDER);
 
It doesn't work quite as expected...
Be careful with those greedy DOT-PLUS thingies!

Try this:

Code: Select all

preg_match_all('#<\?(?:php)?(?:(?!\?>).)*\?>#si', $source, $matches, PREG_SET_ORDER);