Page 1 of 1

advanced regex

Posted: Tue Apr 30, 2013 3:59 pm
by Tincan
hi there,

first i want to say sorry :) i come from germany and try my best to write good english.

i have the following problem:
i am writing a php script that analyses other php files if there are class definitions inside. therefor i need a regex that matches all classes/namespaces
the regex has to be something like that
[syntax]#(class[\s]*(?<classname>[^\s\{]*)[\s]*\{(?<classcontent>.*?))#is
(namespace[\s]*(?<namespace>[^\s\{]*)[\s]*\{(?<namespacecontent>.*?))#is[/syntax]
the next given problem is that i use namespaces so i can have the class 'myClass' in several namespaces :(
therefor i changed all my scripts so they look like that

Code: Select all

<?php
declare(encode='whatever');
namespace a {
  // some content
  class myClass {
    // something inside
  }
}
namespace b {
  // some other content
  class myClass {
    // something else inside
  }
}
?>
so i tried to create a regex matching this. my result was
[syntax]
#^<\?(?:php)?[\s]*(?:declare[\s]*\([^\'\"]*(\'|\")(?:.*?)(?:\1)\)[\s]*)?(.*?)(?=[\s]*\?>)$#is

for explaination
^<\?(?:php)?[\s]* -> gets <?php and <? also whitespaces and linebreaks
(?:declare[\s]*\([^\'\"]*(\'|\")(?:.*?)(?:\1)\)[\s]*)? -> gets a declare(encoding=''); the only thing allowed when using namespaces
(.*?) - gets the rest until the php end flag
(?=[\s]*\?>)$ - gets whitespaces and linebreaks followed by ?>
[/syntax]
this regex works, but if i try to integrate something for my namespaces or classes i get no results when using prey_match_all.

my script stores all found classes/files to an array that is writen to a file so i can easily access this

Code: Select all

$variable = array(
  [0] -> array (
    ['name'] -> 'myClass',
    [0] -> array (
      ['namespace'] -> 'namespace a',
      ['file'] -> '/dir/sub/location_a'
      ),
    [0] -> array (
      ['namespace'] -> 'namespace b',
      ['file'] -> '/dir/sub/location_b'
      )
    )
  );
if a file can not be located from this array the next time (file renamed or moved or something else) i want to browse all files and use the regex above to filter all classes in that file so i get a prey_match_all result like that

Code: Select all

array (
  [0] -> '***file_get_contents()***'
  [1] -> array (
    ['namespace'] -> 'namespace_a',
    ['classes'] -> array (
      [0] -> 'myClass'
      )
    ),
  [2] -> array (
    ['namespace'] -> 'namespace_a',
    ['classes'] -> array (
      [0] -> 'myClass'
      )
    )
  )
so my question is:

is there a way to create a regex that splits my file_get_contents string so i get all classes (no matter if they are commented or defined inside a function or whatever) with the namespace they are inside ?

Re: advanced regex

Posted: Tue Apr 30, 2013 4:47 pm
by requinix
This is a perfect chance to use a tokenizer. Like the one built into PHP.

Parse the file as if it were actual PHP code by reading tokens. When you find a T_CLASS then you know the next string is the class name; when you find a T_NAMESPACE then the next string is the namespace name (and stuff inside is part of the namespace).

Re: advanced regex

Posted: Thu May 09, 2013 9:04 am
by Tincan
the internal tokenizer is a nice thing, but as i tested it, it took about 6 times longer to analyze my files then my current regex expression.

the tokenizer analyses all tokens inside a php file, i need only namespaces and classes, so a regex solution is more efficient then the tokenizer.
and there are some specials the tokenizer can not handle that easy then a regex.
for example

Code: Select all

namespace {
  class myClass {
    // some code inside
  }
}
the string that follows on the namespace token is NOT the namespace name, it's the class name.
so i had to write a code that filters also empty namespace names so in the end the tokenizer takes more then 8 times longer then my current regex.

Code: Select all

		$currentFileContent = file_get_contents(__FILE__);
		// filter the php start (with or without declare statement) and endtag <_?_php ?_>
		$currentFileContent = preg_replace(array('#^\<\?(php)?[\s]*#', '#declare[\s]*\([^\'\"]*(\'|\")(.*?)(\1)\)[\s]*;[\s]*#is', '#([\s]*\?>)$#is'), array('', '', ''), $currentFileContent);
		// filter namespaces if used
		$currentNamespace = preg_split('#[\s]*(\;|\{)#', $currentFileContent, 2);
		$currentNamespace[0] = preg_replace('#^namespace#', '', $currentNamespace[0]);
		preg_match_all('#(?:[\s]?(?:final[\s]+)?class[\s]+([^\s|^\{]+)(?:[\s]+extends[\s]+[^\s|^\{]+)?(?:[\s]+implements[\s]+(?:[^\s|^\,]+(?:[\s]*\,[\s]*[^\s|^\,]+)*))?[\s]?\{)#is', $currentNamespace[1], $currentClasses);
this is what i currently use.

so i want to know if this little code catches all classes (with or without namespaces).
i think i should eliminate all comments like the following before searching for namespaces and classes, shouldn't i?

Code: Select all

/**
* comment
*/

// these comments

/*
* and these comments
*/

Re: advanced regex

Posted: Thu May 09, 2013 12:31 pm
by requinix
It might be slower but it actually works without having to throw in crazy hacks for things like comments.

You now have four regular expressions and you're still not done, right? How many more do you think you'll need? Before you reach the same level of quality that you could get with a tokenizer?