[SOLVED] RFC 2822 with stupid circular references :(

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

[SOLVED] RFC 2822 with stupid circular references :(

Post by Chris Corbyn »

I was moving along so well with this as well. I'm going for 100% compliancy with a whole bunch of RFC's (RFC 2822 providing the bulk of it).

In my quest for validating and parsing some parts of email headers I'm turning the ABNF syntax (i.e. this stuff

Code: Select all

comment         =       "(" *([FWS] ccontent) [FWS] ")"
)

into PCRE groups which I can "glue together" to make the tokens described in the RFC.

However, I've hit a big hurdle (a brick wall??):

Code: Select all

ccontent        =       ctext / quoted-pair / comment

comment         =       "(" *([FWS] ccontent) [FWS] ")"
I was defining tokens in *exactly* the same way the RFC refers to them until this point:

Code: Select all

//Refer to RFC 2822 for ABNF
    $noWsCtl = '[\x01-\x08\x0B\x0C\x0E-\x19\x7F]';
    
    $text = '[\x00-\x08\x0B\x0C\x0E-\x7F]';
    $quotedPair = '\\\\' . $text;
    
    $atext = '[a-zA-Z0-9!#\$%&\'\*\+\-\/=\?\^_`\{\}\|~]';
    $dotAtomText = $atext . '+' . '(\.' . $atext . '+)*?';
    
    $qtext = '(?:' . $noWsCtl . '|[\x21\x23-\x5B\x5D-\x7E])';
    $noFoldQuote = '"(?:' . $qtext . '|' . $quotedPair . ')*?"';
    
    $dtext = '(?:' . $noWsCtl . '|[\x21-\x5A\x5E-\x7E])';
    $noFoldLiteral = '\[(?:' . $dtext . '|' . $quotedPair . ')*?\]';
    
    $idLeft = '(?:' . $dotAtomText . '|' . $noFoldQuote . ')';
    $idRight = '(?:' . $dotAtomText . '|' . $noFoldLiteral . ')';
    
    $WSP = '[ \t]';
    $CRLF = '\r\n';
    
    $FWS = '(?:' . $WSP . '*' . $CRLF . ')?' . $WSP;
    
    $ctext = '(?:' . $noWsCtl . '|[\x21-\x27\x2A-\x5B\x5D-\x7E])';

//AGRRRAAAGGGGHHHH!!!!!
    $comment = '\((?:' . $FWS . '|' . $ccontent. ')*?' . $FWS . '?\)';
    $ccontent = '(?:' . $ctext . '|' . $quotedPair . '|' . $comment . ')';
ccontent refers to comment, and comment refers to ccontent so I can't see a way to write a regex which matches this :( Any ideas? Maybe I'll just have to be as close a reasonably possible here...

//Hmm... do I remember someone mentioning a 'recurse' flag in PCRE? :idea:

EDIT | I'll mull this over and continue... for now, a handy TODO :P

Code: Select all

//TODO: Make this RFC2822 compliant (support comment nesting -- e.g. add |comment)
    $ccontent = '(?:' . $ctext . '|' . $quotedPair . ')';
Last edited by Chris Corbyn on Wed Jan 09, 2008 5:53 am, edited 1 time in total.
vapoorize
Forum Newbie
Posts: 22
Joined: Mon Dec 17, 2007 5:35 pm

Post by vapoorize »

I was defining tokens in *exactly* the same way the RFC refers to them until this point:
In my quest for validating and parsing some parts of email headers
ccontent refers to comment, and comment refers to ccontent so I can't see a way to write a regex which matches this Sad Any ideas? Maybe I'll just have to be as close a reasonably possible here...

Well it seems you are relying too much on anyone who wants to help you knowing RFC standards on what you're talking about... I don't feel like reading an RFC, :)

Keep it simple. Whatever your application, data is data.
Post some sample data and your expected matches in it, or what you're trying to match, or what you can't match.


//Hmm... do I remember someone mentioning a 'recurse' flag in PCRE? Idea
If you mean recursive functionality in preg_replace, take a look at examp #3 here:
http://us.php.net/preg_replace_callback
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Sorry I think you've got the wrong end of the stick ;) Effectively all I'm asking is how would one tackles this scenario:

Code: Select all

//comment is (, followed by the definition of commentBody, followed by )
$comment = '\(' . $commentBody . '\)';

//However, commentBody is [a-zA-Z0-9] OR a nested comment, repeated any number of times!!
$commentBody = '([a-zA-Z0-9]|' . $comment . ')*';


$fullPattern = '/^' . $comment . '$/D';
See the problem? One part of the regex is actually the entire regex itself.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

http://perldoc.perl.org/perlre.html
Perl Doc wrote:(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)
Similar to (??{ code }) except it does not involve compiling any code, instead it treats the contents of a capture buffer as an independent pattern that must match at the current position. Capture buffers contained by the pattern will have the value as determined by the outermost recursion.

PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture buffer to recurse to. (?R) recurses to the beginning of the whole pattern. (?0) is an alternate syntax for (?R). If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture buffers and positive ones following. Thus (?-1) refers to the most recently declared buffer, and (?+1) indicates the next buffer to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed buffers are included.
Thus, imagine we can have parentheses, with any word character inside, or another nested parentheses up to any depth of nesting.

i.e.

(abc_123)
(abc_123(cde_456)789)
(abc(xyz(123)89)(foo(bar(zip(button))))test)

Code: Select all

private $_regex;
  
  public function setUp()
  {
    $parenStr = '(\((?:\w|(?1))+?\))';
    $this->_regex = '/^' . $parenStr . '$/';
  }
  
  public function testMatching()
  {
    $this->assertPattern($this->_regex, '(abc_123)');
    $this->assertPattern($this->_regex, '(abc_123(cde_456)789)');
    $this->assertPattern($this->_regex, '(abc(xyz(123)89)(foo(bar(zip(button))))test)');
  }
Seems to work a peach :)

I can't get the relative capturing position to work though which would be more useful in my case (i.e. +0).
Post Reply