d11wtq wrote:I'm intrigued. So lets clarify what you need to acheive.
1. Find a semi-colon
2. Must be a delimiter
3. By which we mean
3. a) It's not inside a string
3. b) It's not inside a comment
3. c) It's not immediately following a backslash
Or does it simply have to be on the end of a line?
If it falls into (a) or (b) or (c) then you're best off going down a tokenzing route depending upon what you're looking to achieve. Regex can do that yes... but it's certainly more than something as simple as a lookbehind
This regex was written by myself to tokenize source code and as you can see it's quite daunting
Code: Select all
$re = "@(?:(?<!\\\\)\'.*?(?<!\\\\)\')|(?:(?<!\\\\)".*?(?<!\\\\)")|(?:(?<!\\\\)#.*?\n)|(?:(?<!\\\\)//.*?\n)|(?:(?<!\\\\)/\\*.*?\\*/)|0x[a-z0-9]+|\\s+|\\W|\\w+@ism";
Now that's reasonably basic... I'll try to explain how it works. Each of the | characters is offering alternatives... so break it down into smaller patterns at those points. The (?<!\\\\) parts or negative lookbehinds to ensure the source didn't contain a backslash to escape to following character (token).
If we break apart the above at each | we see (in this order):
Single quoted 'strings'
Double quoted "strings"
Hash #style comments
C style //comments
Multi-line /* comments */
Hexadecimal 0x0F numbers
Whitespace
Non-alphanumeric characters
Alphanumeric characters
You'd need to go down those lines but perhaps in more detail with the non-alphanumeric chars (that's essentially syntax in itself).
If you really wanted to you could add anything specific (such as semi-colons) as long as you insert in to the pattern after the strings and the comments
Post here again if I've utterly confused you... which I think I have

Sounds like you're trying to parse a language though
EDIT | I'm off to bed but I'll check this again when I wake and see if you got this sorted

Hey man...
If you can solve this using regex...I'll call you
"Da' man" for a week
Da' man - Is a highly regarded, well respected title coming from me...just ask anyone who knows me
Anyways...the problem...
Regex to basically locate statements inside a C style/hybrid source file...like PHP or Perl, etc...
So the semi-colon MAY NOT be located EOL...something that simple won't work...
As you know...it's syntactically valid to write something like:
All on the same line...so yea...assuming EOL won't work
I cannot use the tokenizer functions...although it sounds like I should...
I do not need the tokens that atomized or such low level tokens anyways
I'm thinking regex won't be able to cut the mustard though
My understanding of parsing and tokenizing goes like this...
PHP requires a CFG to tokenize/parse the source and regex won't cut it...because I believe it falls under the non-recursive context-sensitive heading...
I think...don't quote me on it...
You can use regex to extract some details of PHP source, like I said earlier...
function declarations/prototypes in C++ or declaration/definition in PHP's case...only because the start and finish of a function declaration are terminals...
Not a valid namespace identifier, but you get the point...
Anyways...
Regex works because it's matching two terminals
function &
);
I can't see regex working on matching a C statement, because it has to many variables which can only be solved recursively with a CFG like EBNF...
For instance:
1) $varname = 'This is a valid statement';
2) func_call('This is a test param');
3) echo 'Hello world';
Not to mention the multiple statements/per line problem...
EBNF works because it tackles tokenizing on a much grander scale...you basically write a grammar to take you from start to finish and when it encounters a non-terminal (an expression, or statement, etc...) it has the ability to pass that onto another tokenizing rule...which can recursively keep doing so until the token has been atomized or becomes a terminal...meaning it can't be broken down any further...like a keyword or number...
Regex tokenizing "I think" is impossible because in order to tokenize statements, like I want too...requires context...meaning...
In order to know what a statement is...you must understand what came before it...AKA what came
before sets the context of what
now becomes...
I think thats where the CFG (context free grammar) comes from...because EBNF has the ability to understand context...
I'm pretty sure regex does not!!! I could be wrong though...
Hence the reason, when I discovered what look behinds were (when you showed me that regex to clean paths of duplicate slashes) I was struck with the idea of using that to locate statements, etc in a source file...
Atleast this way...you find the semi-colon (aka terminal) and work backwards until you have reached something you know would prevent an statement from continuing...
ie: */ or // or ;
But....I'm pretty sure regex can't handle this because of the implementation uses a DFA (deterministic finite automaton),...
Don't ask me WTF that is exactly...cuz I don't know...never interested me enough to bother learning about
I think it has to do with a finite state machine...and judging by the sounds of that...to me it would suggest...
Anything that implements a DFA would have a finite state...every time an engine has to change course, it needs to save it's state on the stack, wherever...
For example:
If you imagine a psuedo regex implementation...you'll likely see something like:
Code: Select all
$patt = '/This/'; // Silly pattern...I don't know regex :(
$buff = "This is the text which we are going to scan";
for($i=0; $i<strlen($buff); $i++){
$char = $buff[$i];
// Trivial pattern match on a char by char basis
for($j=0; $j<strlen($patt); $j++){
$char2 = $patt[$j];
if($char2 == $char)
// Concatenate character to temp buffer
// Compare temp buffer with [b]$buff[/b] using substr()
// start is the first offset when we found a match
// count is the first offset when we found a match + $j
// if comparison is true, push result into $arr_match and continue
else
// Match failed, clear buffer, continue search from last known offset
}
}
Anyways enough psuedo code
You can see, that in order for a look behind to work, regex needs some way of saving state (current offset, where it started, etc...)
so going backwards would require again...yet another state mechanism...
I have no idea whether thats why regex look behinds are fixed width, but thats my guess...
Anyways...I just visited another programming site which I frequent...and got carried away at that forum and have now lost my chain of thought....and i'm sure everyone is tired of my rambling...so I'll go away now
Pardon the brain dump...but I've for a while now...wanted to share my perception of parsing/tokenizing with someone other than my dog or Mom who can't even find the Return key
Sleepy time
