regex/string manipulation for tokenizer/parser

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
rhaertel80
Forum Newbie
Posts: 1
Joined: Tue Jan 09, 2007 3:34 pm

regex/string manipulation for tokenizer/parser

Post by rhaertel80 »

I'm trying to use regular expressions in a tokenizer/parser I am writing. I basically need to test if a string matches a regular expression, starting at a certain index. Since this is a parser it must be efficient.

It would be tempting to try:

Code: Select all

preg_match($buffer, $pattern, $matches, PREG_OFFSET_CAPTURE, $index)
with a pattern such as "/^[\w]+/". However, the document explains that this will not work (and I confirmed this).

The obvious answer would be to take the substr, but I'm worried about the efficiency of this operation as it will happen for almost every character in the string to match against. Without knowing for sure, I'll bet that each substr operation copies the (necessary) contents of the parent array--too much overhead for my application.

In C/C++, one might try passing the address of the start of the substring, i.e. (in C syntax)

Code: Select all

preg_match(&buffer[index], pattern, matches)
. I'm somewhat new to PHP, but I'm pretty sure this is not possible. Can anyone confirm this?

I also thought of using array_pop to permanently discard elements of the string (in which case index is no longer necessary). This doesn't appear to be legal in PHP either. Is there a way of accomplishing this (while keeping the string--otherwise I can't do regular expression matching).

Are there any other ways of doing this short of writing my own regular expression compiler?

Thanks in advance,
Robbie
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

Sorry, I'm a little foggy on what you want to do - can you provide a test case?
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Do you mean you only want to look at a subset of characters from the string? substr() would be the most ideal way to do that yes.

substr() isn't unefficient although if this is to happen for every token in your string then I'm not sure how fast it would be. PHP is a copy-on-write* language, so ideally it won't make two copies of the parent string.

Xdebug is a PHP extension which can do various memory benchmarks etc... I use this for large string processing when I come to optimize my code. That said, parser/tokenizer or not, Optimize later -- once it works.

* copy-on-write simply means that this doesn't duplicate values in memory:

Code: Select all

$var1 = "foo";
$var2 = $var1;
$var3 = $var2;
//etc
But this does:

Code: Select all

$var1 = "foo";
$var2 = $var1;
//copy made here
$var3 = $var2 . "bar";
Post Reply