Page 1 of 1

regex/string manipulation for tokenizer/parser

Posted: Tue Jan 09, 2007 4:47 pm
by rhaertel80
I'm trying to use regular expressions in a tokenizer/parser I am writing. I basically need to test if a string matches a regular expression, starting at a certain index. Since this is a parser it must be efficient.

It would be tempting to try:

Code: Select all

preg_match($buffer, $pattern, $matches, PREG_OFFSET_CAPTURE, $index)
with a pattern such as "/^[\w]+/". However, the document explains that this will not work (and I confirmed this).

The obvious answer would be to take the substr, but I'm worried about the efficiency of this operation as it will happen for almost every character in the string to match against. Without knowing for sure, I'll bet that each substr operation copies the (necessary) contents of the parent array--too much overhead for my application.

In C/C++, one might try passing the address of the start of the substring, i.e. (in C syntax)

Code: Select all

preg_match(&buffer[index], pattern, matches)
. I'm somewhat new to PHP, but I'm pretty sure this is not possible. Can anyone confirm this?

I also thought of using array_pop to permanently discard elements of the string (in which case index is no longer necessary). This doesn't appear to be legal in PHP either. Is there a way of accomplishing this (while keeping the string--otherwise I can't do regular expression matching).

Are there any other ways of doing this short of writing my own regular expression compiler?

Thanks in advance,
Robbie

Posted: Wed Jan 10, 2007 1:32 am
by Kieran Huggins
Sorry, I'm a little foggy on what you want to do - can you provide a test case?

Posted: Wed Jan 10, 2007 2:32 am
by Chris Corbyn
Do you mean you only want to look at a subset of characters from the string? substr() would be the most ideal way to do that yes.

substr() isn't unefficient although if this is to happen for every token in your string then I'm not sure how fast it would be. PHP is a copy-on-write* language, so ideally it won't make two copies of the parent string.

Xdebug is a PHP extension which can do various memory benchmarks etc... I use this for large string processing when I come to optimize my code. That said, parser/tokenizer or not, Optimize later -- once it works.

* copy-on-write simply means that this doesn't duplicate values in memory:

Code: Select all

$var1 = "foo";
$var2 = $var1;
$var3 = $var2;
//etc
But this does:

Code: Select all

$var1 = "foo";
$var2 = $var1;
//copy made here
$var3 = $var2 . "bar";