Page 1 of 2

Regex negative look behind???

Posted: Tue Dec 13, 2005 6:53 pm
by alex.barylski
Not very skilled in regex, but here goes my question for the day... :)

I am searching for a semi-colon in source code...but I only want to have it match if the semi-colon is indeed a statement delimiter...

I figure this is possible by using a negative look behind??? From my understanding of what one is...

Basically my regex would go something like this:

1) Find a semi-colon
2) Trace backwards until we encounter the following:
- Another semi-colon, comment (// or */), PHP script tags (<? or <?php)
- The previous comparison MUST be ignored if were inside a STRING or COMMENT

Is this possible in regex???

Cheers :)

Posted: Tue Dec 13, 2005 10:06 pm
by alex.barylski
Doesn't look like it's possible :(

look behinds must have a fixed-width...which AFAIK means...what I want to do isn't possible :(

Boo hoo :)

Posted: Tue Dec 13, 2005 10:32 pm
by shoebappa
I'm not entirely sure what you're trying to do, or what a "look behind" is.

If you're trying to find lines the end in a ; that aren't in comments, and are in PHP code.

First you could match the PHP code, then strip out all comments. Then match lines ending in ;

Posted: Tue Dec 13, 2005 10:45 pm
by Zoxive
...

Posted: Tue Dec 13, 2005 10:59 pm
by alex.barylski
NonStopableForce wrote:Its possible, but i dont know how to do it, im no help.

Just saying its possible.

-NSF
The more I read about look behinds...the more i'm thinking it's not technically possible...

It's hard to find anything worth while reading...it's obviously a complex subject, under the context of regex anyways, cuz not many articles explain it well... :lol:

I'm guessing it has somehting to do with the way regex is implemented using a DFA...thats why it's limited to a fixed width look behind...

Of course thats just a guess going on my basic understanding of computer science :)

Posted: Tue Dec 13, 2005 11:02 pm
by alex.barylski
shoebappa wrote:I'm not entirely sure what you're trying to do, or what a "look behind" is.

If you're trying to find lines the end in a ; that aren't in comments, and are in PHP code.

First you could match the PHP code, then strip out all comments. Then match lines ending in ;
Basically use regex to locate all "statements" in a PHP source file...or any C hybrid language for that matter...

I can't strip anything though, cuz I need to perform other operations...and well...simply just reinitilizing the buffer...that doesn't appeal to me :?

Posted: Wed Dec 14, 2005 2:07 pm
by Zoxive
...

Posted: Wed Dec 14, 2005 7:27 pm
by alex.barylski
NonStopableForce wrote:viewtopic.php?t=40169
Negative lookbehind syntax:
Code:

Code: Select all

(?<! ... )pattern

Example negative lookbehind:
Code:

Code: Select all

/\b(?<!a)[a-z]+/i

The above matches any word which does NOT start with a letter "a". The \b assertion just makes sure were at the start of a word.
Dont get exacly what you want, but i found that.

-NSF
I apologize if I sound ignorant...and this isn't targeted soley at you...but it seems this community is to eager to help and often that leads to an almost insulting reply...

Basically...what i'm trying to say is...unless I ask for resources or references or articles, etc...

I'm not looking for URL's

I know how to Google and search too...after all I am a programmer also ;)

I've read d11wtq articles on regex...a few times each...even been helped directly by him - thanks dude!!!

This reply is not intended to disuade you from ever replying to my posts again...thinking i'm ungrateful...but honestly...only a handful of times have I ever recieved the answer I was looking for on these forums :?

Partially my fault, for not explaining myself...but i'm lazy... :)

So take not kiddies :twisted:

Unless I ask for a URL...don't bother googling...cuz I've already done that...as a programmer i'm impatient - I want answers now too...so thats the first thing I do...is Google keywords and so on...

*** END OF RANT ***

Cheers :)

Posted: Wed Dec 14, 2005 7:40 pm
by Chris Corbyn
I'm intrigued. So lets clarify what you need to acheive.

1. Find a semi-colon
2. Must be a delimiter
3. By which we mean
3. a) It's not inside a string
3. b) It's not inside a comment
3. c) It's not immediately following a backslash

Or does it simply have to be on the end of a line?

If it falls into (a) or (b) or (c) then you're best off going down a tokenzing route depending upon what you're looking to achieve. Regex can do that yes... but it's certainly more than something as simple as a lookbehind :)

This regex was written by myself to tokenize source code and as you can see it's quite daunting ;)

Code: Select all

$re = "@(?:(?<!\\\\)\'.*?(?<!\\\\)\')|(?:(?<!\\\\)\".*?(?<!\\\\)\")|(?:(?<!\\\\)#.*?\n)|(?:(?<!\\\\)//.*?\n)|(?:(?<!\\\\)/\\*.*?\\*/)|0x[a-z0-9]+|\\s+|\\W|\\w+@ism";
Now that's reasonably basic... I'll try to explain how it works. Each of the | characters is offering alternatives... so break it down into smaller patterns at those points. The (?<!\\\\) parts or negative lookbehinds to ensure the source didn't contain a backslash to escape to following character (token).

If we break apart the above at each | we see (in this order):

Single quoted 'strings'
Double quoted "strings"
Hash #style comments
C style //comments
Multi-line /* comments */
Hexadecimal 0x0F numbers
Whitespace
Non-alphanumeric characters
Alphanumeric characters

You'd need to go down those lines but perhaps in more detail with the non-alphanumeric chars (that's essentially syntax in itself).

If you really wanted to you could add anything specific (such as semi-colons) as long as you insert in to the pattern after the strings and the comments ;)

Post here again if I've utterly confused you... which I think I have :? Sounds like you're trying to parse a language though :)

EDIT | I'm off to bed but I'll check this again when I wake and see if you got this sorted ;)

Posted: Wed Dec 14, 2005 11:05 pm
by alex.barylski
d11wtq wrote:I'm intrigued. So lets clarify what you need to acheive.

1. Find a semi-colon
2. Must be a delimiter
3. By which we mean
3. a) It's not inside a string
3. b) It's not inside a comment
3. c) It's not immediately following a backslash

Or does it simply have to be on the end of a line?

If it falls into (a) or (b) or (c) then you're best off going down a tokenzing route depending upon what you're looking to achieve. Regex can do that yes... but it's certainly more than something as simple as a lookbehind :)

This regex was written by myself to tokenize source code and as you can see it's quite daunting ;)

Code: Select all

$re = "@(?:(?<!\\\\)\'.*?(?<!\\\\)\')|(?:(?<!\\\\)".*?(?<!\\\\)")|(?:(?<!\\\\)#.*?\n)|(?:(?<!\\\\)//.*?\n)|(?:(?<!\\\\)/\\*.*?\\*/)|0x[a-z0-9]+|\\s+|\\W|\\w+@ism";
Now that's reasonably basic... I'll try to explain how it works. Each of the | characters is offering alternatives... so break it down into smaller patterns at those points. The (?<!\\\\) parts or negative lookbehinds to ensure the source didn't contain a backslash to escape to following character (token).

If we break apart the above at each | we see (in this order):

Single quoted 'strings'
Double quoted "strings"
Hash #style comments
C style //comments
Multi-line /* comments */
Hexadecimal 0x0F numbers
Whitespace
Non-alphanumeric characters
Alphanumeric characters

You'd need to go down those lines but perhaps in more detail with the non-alphanumeric chars (that's essentially syntax in itself).

If you really wanted to you could add anything specific (such as semi-colons) as long as you insert in to the pattern after the strings and the comments ;)

Post here again if I've utterly confused you... which I think I have :? Sounds like you're trying to parse a language though :)

EDIT | I'm off to bed but I'll check this again when I wake and see if you got this sorted ;)
Hey man... :)

If you can solve this using regex...I'll call you "Da' man" for a week ;)

Da' man - Is a highly regarded, well respected title coming from me...just ask anyone who knows me

Anyways...the problem...

Regex to basically locate statements inside a C style/hybrid source file...like PHP or Perl, etc...

So the semi-colon MAY NOT be located EOL...something that simple won't work... :(

As you know...it's syntactically valid to write something like:

Code: Select all

int a = 0; int b = 2;
All on the same line...so yea...assuming EOL won't work :?

I cannot use the tokenizer functions...although it sounds like I should...

I do not need the tokens that atomized or such low level tokens anyways :)

I'm thinking regex won't be able to cut the mustard though :(

My understanding of parsing and tokenizing goes like this...

PHP requires a CFG to tokenize/parse the source and regex won't cut it...because I believe it falls under the non-recursive context-sensitive heading...

I think...don't quote me on it...

You can use regex to extract some details of PHP source, like I said earlier...

function declarations/prototypes in C++ or declaration/definition in PHP's case...only because the start and finish of a function declaration are terminals...

Code: Select all

/function\s*(.+)\(.+\);/
Not a valid namespace identifier, but you get the point... :P

Anyways...

Regex works because it's matching two terminals function & );

I can't see regex working on matching a C statement, because it has to many variables which can only be solved recursively with a CFG like EBNF...

For instance:

1) $varname = 'This is a valid statement';
2) func_call('This is a test param');
3) echo 'Hello world';

Not to mention the multiple statements/per line problem...

EBNF works because it tackles tokenizing on a much grander scale...you basically write a grammar to take you from start to finish and when it encounters a non-terminal (an expression, or statement, etc...) it has the ability to pass that onto another tokenizing rule...which can recursively keep doing so until the token has been atomized or becomes a terminal...meaning it can't be broken down any further...like a keyword or number...

Regex tokenizing "I think" is impossible because in order to tokenize statements, like I want too...requires context...meaning...

In order to know what a statement is...you must understand what came before it...AKA what came before sets the context of what now becomes...

I think thats where the CFG (context free grammar) comes from...because EBNF has the ability to understand context...

I'm pretty sure regex does not!!! I could be wrong though...

Hence the reason, when I discovered what look behinds were (when you showed me that regex to clean paths of duplicate slashes) I was struck with the idea of using that to locate statements, etc in a source file...

Atleast this way...you find the semi-colon (aka terminal) and work backwards until you have reached something you know would prevent an statement from continuing...

ie: */ or // or ;

But....I'm pretty sure regex can't handle this because of the implementation uses a DFA (deterministic finite automaton),...

Don't ask me WTF that is exactly...cuz I don't know...never interested me enough to bother learning about :P

I think it has to do with a finite state machine...and judging by the sounds of that...to me it would suggest...

Anything that implements a DFA would have a finite state...every time an engine has to change course, it needs to save it's state on the stack, wherever...

For example:

If you imagine a psuedo regex implementation...you'll likely see something like:

Code: Select all

$patt = '/This/'; // Silly pattern...I don't know regex :(
$buff = "This is the text which we are going to scan";


for($i=0; $i<strlen($buff); $i++){
  $char = $buff[$i];

  // Trivial pattern match on a char by char basis
  for($j=0; $j<strlen($patt); $j++){
    $char2 = $patt[$j];

    if($char2 == $char)
      // Concatenate character to temp buffer
      // Compare temp buffer with [b]$buff[/b] using substr()
      // start is the first offset when we found a match
      // count is the first offset when we found a match + $j
      // if comparison is true, push result into $arr_match and continue
    else
      // Match failed, clear buffer, continue search from last known offset
  }
}
Anyways enough psuedo code :roll:

You can see, that in order for a look behind to work, regex needs some way of saving state (current offset, where it started, etc...)

so going backwards would require again...yet another state mechanism...

I have no idea whether thats why regex look behinds are fixed width, but thats my guess...

Anyways...I just visited another programming site which I frequent...and got carried away at that forum and have now lost my chain of thought....and i'm sure everyone is tired of my rambling...so I'll go away now :)

Pardon the brain dump...but I've for a while now...wanted to share my perception of parsing/tokenizing with someone other than my dog or Mom who can't even find the Return key :?

Sleepy time :)

Posted: Thu Dec 15, 2005 5:11 am
by Chris Corbyn
Complex 8O

You're right in your assumptions about the fixed width lookbehinds and hence it's not ideal for everything, including most of this.

You sound like you've done your research and understand your goals here far more than I do so I'm not sure how far I can take you with this. Regex, from the wider perspective looks like a lost cause but (I haven't looked closely enough) perhaps with lots of recursion you'll produce some good results.

Let us know if you come up with anything.... this looks very interesting indeed ;)

Posted: Thu Dec 15, 2005 9:17 am
by sweatje
One ugly hack to deal with the lack of variable length look behinds is to reverse the string and use look aheads instead...

Posted: Thu Dec 15, 2005 9:28 am
by feyd
I feel a string parser would fair a lot easier and faster (performance) to do this task.

Posted: Thu Dec 15, 2005 3:38 pm
by Ambush Commander
Would you like me to write one?

Posted: Thu Dec 15, 2005 5:16 pm
by alex.barylski
d11wtq wrote:Complex 8O

You're right in your assumptions about the fixed width lookbehinds and hence it's not ideal for everything, including most of this.

You sound like you've done your research and understand your goals here far more than I do so I'm not sure how far I can take you with this. Regex, from the wider perspective looks like a lost cause but (I haven't looked closely enough) perhaps with lots of recursion you'll produce some good results.

Let us know if you come up with anything.... this looks very interesting indeed ;)
Sh*te :lol:

Maybe i'll just use the PHP tokenizer...