Page 1 of 1

lookbehinds problem

Posted: Mon May 18, 2009 3:19 pm
by michalmas
Hello,

I have strange behaviour for my test with negative lookbehind.

I have the text:

Code: Select all

some text is here ble bla blu
     now there is some 
     elem xsxsx
     struc something
     now we are inside and next element:
     elem value
     something els, and again
     elem val2
     end struc
and reg expr:

Code: Select all

((?<=struc\s).)*elem\s\S+
I want it to match all elem elements that are preceded by the struc. So, i want to get:

Code: Select all

elem value
elem val2
elem valXXX
Now i get:

Code: Select all

elem xsxsx
elem value
elem val2
elem valXXX
Thanks!

Re: lookbehinds problem

Posted: Thu May 21, 2009 2:12 am
by prometheuzz
That can't be true since there is no "elem valXXX" substring in your example input string.

Re: lookbehinds problem

Posted: Thu May 21, 2009 3:17 am
by michalmas
Oh, that's right. I haven't copied everything.

So, the text is:

Code: Select all

some text is here ble bla blu
now there is some 
elem xsxsx
struc something
now we are inside and next element:
elem value
something els, and again
elem val2
end struc
this is not inside
elem valXXX
struc 
asasas
end struc
sdffdsd
 
I believe that the main error in the reg exp is the *, so it matches any number of struc before, also zero times. But if i try to replace it by +, then i don't get any results.

I also tried to nest the expression ans say that after struc there needs to be any character. This didn't work either.

Re: lookbehinds problem

Posted: Thu May 21, 2009 4:02 am
by Weirdan
So you need to match 'elem something' given there was a 'struc' earlier in the text? I guess you need to drop second parenthesis around the lookbehind, like this (?<=struc\s.*)elem\s\S+
Also make sure you're using multiline regexp and dot matches newlines as well (by specifying /mi flag combo).

Re: lookbehinds problem

Posted: Thu May 21, 2009 4:26 am
by michalmas
@Weirdan:

Your solution doesn't return any result (dot matching new lines in on).

The main purpose of this is to get all elem that are INSIDE the struc-endstruc structure. So, the requirement is that there was struc before but simultanously there was no end-struc after it.

Re: lookbehinds problem

Posted: Thu May 21, 2009 4:58 am
by prometheuzz
Do it in two steps:

1 - get everything in between struc and end struc
2 - for every match in step 1, find all elem's

And if your struc/end struc's are nested, then regex is not the right tool for the job. You need a true recursive descent parser. In which case, Google is your friend.

If they're not nested, this might work (untested!):

Code: Select all

'/elem\s\S+(?=(?:(?!struc).)*end\sstruc)/s'
Btw, this looks like the same thing as in your other thread. Perhaps it's better to keep the discussion in one thread?

Good luck.

Re: lookbehinds problem

Posted: Thu May 21, 2009 5:01 am
by prometheuzz
Weirdan wrote:So you need to match 'elem something' given there was a 'struc' earlier in the text? I guess you need to drop second parenthesis around the lookbehind, like this (?<=struc\s.*)elem\s\S+ ...

"Variable length look behinds" are not supported by PHP's preg-methods.

Re: lookbehinds problem

Posted: Thu May 21, 2009 12:15 pm
by Weirdan
prometheuzz wrote:
Weirdan wrote:like this (?<=struc\s.*)elem\s\S+ ...
"Variable length look behinds" are not supported by PHP's preg-methods.
You are right, I forgot that. :oops:

Re: lookbehinds problem

Posted: Thu May 21, 2009 5:16 pm
by michalmas
Do it in two steps:

1 - get everything in between struc and end struc
2 - for every match in step 1, find all elem's
The requirement is that it had to be in one expression. I realize that the problem could be easily solved if the special program was created.
And if your struc/end struc's are nested, then regex is not the right tool for the job. You need a true recursive descent parser. In which case, Google is your friend.
I agree - nesting can't be expressed in reg exps. And you are right - the alternative solution was the parser.
Btw, this looks like the same thing as in your other thread. Perhaps it's better to keep the discussion in one thread?
The problem is exactly the same, but i wanted to approach it from two different views (the most intuitive). Later i will join them though.
"Variable length look behinds" are not supported by PHP's preg-methods.
I am using PowerGrep for this :oops:
elem\s\S+(?=(?:(?!struc).)*end\sstruc)
And now i am lost. It is exactly what you suggested me some time ago, but then i couldn't make it working. But now - it is. And the hack you used - it works even for nested elements.
But to be sure - you check if elem is followed by end-struc which is not preceded by struc (the hack)?

AND:
does anyone knows why neither of the

Code: Select all

((?<=struc\s).)*elem\s\S+
or

Code: Select all

(?<=struc\s.*)elem\s\S+
works? Neither in Perl or PowerGrep...

Thanks!

Re: lookbehinds problem

Posted: Fri May 22, 2009 7:40 am
by prometheuzz
michalmas wrote:
"Variable length look behinds" are not supported by PHP's preg-methods.
I am using PowerGrep for this :oops:
I've never used PowerGrep, but I am pretty sure it also does not support "look behinds" without a fixed length. Very few regex engines do (not even Perl's regex engine does!).
michalmas wrote:
elem\s\S+(?=(?:(?!struc).)*end\sstruc)
And now i am lost. It is exactly what you suggested me some time ago, but then i couldn't make it working. But now - it is. And the hack you used - it works even for nested elements.
But to be sure - you check if elem is followed by end-struc which is not preceded by struc (the hack)?
A short explanation is in order:

Code: Select all

elem              // match "elem"
\s                // match any white space char
\S+               // match one or more characters other than white space chars
(?=               // start positive look ahead
  (?:             //   start non capturing group 1
    (?!struc).    //     when looking ahead there's no string "struc", then match any character
  )               //   end non capturing group 1
  *               //   non capturing group 1 zero or more times
  end\sstruc      //   match "end", a white space char followed by "struc"
)                 // end positive look ahead
michalmas wrote:AND:
does anyone knows why neither of the

Code: Select all

((?<=struc\s).)*elem\s\S+
or

Code: Select all

(?<=struc\s.*)elem\s\S+
works? Neither in Perl or PowerGrep...

Thanks!
Just for clarity, could you post this question with the target text? Thanks.