Page 1 of 1

A regex doubt

Posted: Sun May 03, 2009 4:20 pm
by a0001_jay
Why does the regex "a+|aabb" match the test string "aabb" as "aa" and not "aabb".

My understanding is

The regex "a+|aabb" means

1. "one or more a's"

or

2. the exact string "aabb"



So for the test string "aabb", condition #2 i.e. "aabb" is satisfied. Also it is the longest match.



Also if you change the regex to "aabb|a+" for the same test string "aabb" it matches "aabb" .



Could some one clear my doubt

Many thanks in advance.

Re: A regex doubt

Posted: Mon May 04, 2009 2:06 am
by prometheuzz
Good question.

In regex-land, there are roughly two different regex implementations:
1 - nondeterministic finite automaton (NFA);
2 - deterministic finite automaton (DFA).

An NFA will go through all possible matches in and will will eventually return, or match, the longest match it has encountered. A DFA will stop matching when it finds the first positive match it encounters. Also, PHP's regex engine evaluates from left to right: with the regex '/a|b|c/', first a check is made to see if there's an 'a', then if there's a 'b' and finally it checks to see if there's a 'c'.

You can probably guess which flavour the PHP preg-library is... indeed: it's a DFA.

A big disadvantage of an NFA is that it will take a very long time when matching large strings and certain regex-es. That's probably why most regex-flavours are DFA's (Java, Perl, JS, Python, etc.).

Re: A regex doubt

Posted: Sun May 10, 2009 11:33 am
by a0001_jay
Thank you prometheuzz for you reply.

Re: A regex doubt

Posted: Mon May 11, 2009 6:48 am
by prometheuzz
a0001_jay wrote:Thank you prometheuzz for you reply.
You're welcome.