Page 1 of 2

Parsers written in PHP (Regex builder)

Posted: Wed Jun 29, 2005 7:16 am
by Chris Corbyn
Has anybody ever written a *simple* parser in PHP?

Not sure there's a *standard* way to write a parser but I'm working on something very nice which involves parsing written english syntax (almost like SQL) to generate regex patterns.

I know I can do it my own twisty way but it will no doubt end up full of hacks and it might not be as efficient as it could be.

If anybody has any good reference material/links/resources which may help me please let me know.

The syntax I'll be parsing looks something like

Code: Select all

MATCH START OF LINE FOLLOWED BY 'foo' FOLLOWED BY 3 DIGITS FOLLOWED BY OPTIONAL('bar')

Code: Select all

/^foo\d{3}(?:bar)?/
Just general parsing help needed... I know it's advanced stuff.

Thanks,

d11

Posted: Wed Jun 29, 2005 7:54 am
by Syranide
My tip is using the common (also used in PHP) way of doing it...

Take your code, split it into different groups, as for your example you could have T_OPER and T_STRING. And then you split your entire string by spaces (of course not quoted) and divide into T_OPER (MAYBE or HAS etc) and T_STRING ('something').

They should have no error-checking or so what so ever, T_OPER shouldn't care if the word exists or is valid, that comes later...
So now you have an array like
0 => T_OPER, "MAYBE"
1 => T_OPER, "SAY"
2 => T_STRING, "'hello'" // having the quotes left is good for escaping

Then you make another pass, which you can treat what you like, either compile into a faster "language" or just translate as it comes...
However, remember that this step is only for a splitup to make it easier to read, you should not start treating things differently such as commandcalls and so on, eg ['3'] should be parsed as T_LBRACK T_STRING T_RBRACK (not as T_BRACKSTRING or so).
(I believe the basic thought behind the parsing groups should be (to some extent), no group should in any way include (parts of) another group.)

Meaning, some kind of state machine.

like, MAYBE can lead to SAY or NOTHING
SAY must lead to a T_STRING... etc etc, and fail if you didn't get what you expected.
(and each path will lead to a specific result (part of the final))

Very brief but I think you understand my point.
This isn't very hard at all and will prove to be very rigid against mistakes and problems.

EDIT: commonly, when parsing the file, you are supposed to be able to recreate an EXACT copy of that file again by simply concatinating all strings... but however, that can prove to be less useful and more troublesome as you would have to take care of whitespace and some troublesome cases when executing your code.

Posted: Wed Jun 29, 2005 8:33 am
by Chris Corbyn
Brilliant... thank you thank you thank you.
Keep them coming :D

Strangely enough I had started with this and was *kind of* heading down that road....

Code: Select all

function Write($Query) {
		
	$this->Query = $Query;
	$this->Pattern = '';
		
	if (!empty($this->Query)) {
		preg_match_all('/(?:\([^\)]+\))|(?:\'[^\']+\')+|[^ ]+/', $this->Query, $m);
		$this->sequence = $m;
	} else {
		$this->output_error(1);
	} //End if
		
} //End Write()
And...

Code: Select all

$SmartExp = new SmartExp;
$SmartExp->Write("MATCH START OF LINE FOLLOWED BY 'foo bar' FOLLOWED BY (3 NUMBERS) AS N");

//Output
echo '<pre>';
print_r($SmartExp->sequence);
echo '</pre>';

/*
[sequence] => Array
    (
        [0] => Array
            (
                [0] => MATCH
                [1] => START
                [2] => OF
                [3] => LINE
                [4] => FOLLOWED
                [5] => BY
                [6] => 'foo bar'
                [7] => FOLLOWED
                [8] => BY
                [9] => (3 NUMBERS)
                [10] => AS
                [11] => N
            )

    )
*/
So if I keep delving deeper towards your pointers I'm on the right track at least :P

Thanks again

Posted: Wed Jun 29, 2005 9:13 am
by Syranide
yupp seem to be doing very good, but (as you proably already know), assigning each entry a class will simplify and speed up the process.

please do post on any success, always interesting to hear about odd projects.
I thought of making a univerisal customizable language for PHP which could be used to pretty much anything, been put on ice for the moment though, might try to revive it some day (playing with 3D is much more fun ;))

EDIT: btw, don't forget to catch escaped characters in string too, could be nice to have ;) ...

Posted: Wed Jun 29, 2005 9:30 am
by Chris Corbyn
Syranide wrote:yupp seem to be doing very good, but (as you proably already know), assigning each entry a class will simplify and speed up the process.
Yeah I've started to break it down into more layers already, though not into separate classes... (I was *hoping* to be able to do this in one class to make it easier for beginners to implement). I'll run some benchmark tests.. it's not super-complex syntax and "queries" or whatever the technical term would be for this thing wont often get stupidly long so it'll likely be neglible.
Syranide wrote: please do post on any success, always interesting to hear about odd projects.
I thought of making a univerisal customizable language for PHP which could be used to pretty much anything, been put on ice for the moment though, might try to revive it some day (playing with 3D is much more fun ;))
I'll be posting *if successful*, a GPL version of the class and I would more than expect it to be snapped up and used by plenty of people in things such as validation... you can basically create a regex object in a really simple way and directly use that in any of the preg_ functions (also built into the class for convenience). It may take some time before I put up a working code (even a beta). Thanks for the support.
Syranide wrote: EDIT: btw, don't forget to catch escaped characters in string too, could be nice to have ;) ...
Yep, I'm trying to make this as user-friendly and clean-cut as I can ;)

Posted: Wed Jun 29, 2005 9:35 am
by Syranide
hehe what I meant with classes was not "class" but more "classification" (like T_OPER), this is very good for speeding up, and most of all, making the interpretation and validation extremely much easier.

EDIT: yeah token is probably a better word (or the correct word)

Posted: Wed Jun 29, 2005 9:44 am
by Chris Corbyn
Syranide wrote:hehe what I meant with classes was not "class" but more "classification" (like T_OPER), this is very good for speeding up, and most of all, making the interpretation and validation extremely much easier.
Ah I got ya...

I call them tokens types but that's probably wrong on my part LOL :P

Well yes... in order to make the whole lot easier to work with there will be more than simply T_STRING, T_AS etc etc.... going through the sequence will be much simpler with less room for error if things are defined as specifically as possible.

Thanks again,

d11

EDIT | I was actually just looking through http://us4.php.net/tokens for ideas on token types.

Posted: Fri Jul 01, 2005 7:52 am
by Chris Corbyn
Just to let you know, here's what I came up with (it goes onto to check syntax errors etc and give very useful debug errors (I even impressed myself with the debug output :lol:)...

Code: Select all

/* Take written English query to build Perl compatible RegExp */
	function Write($Query) {
		
		$this->Query = $Query;
		$this->Pattern = '';
		
		if (!empty($this->Query)) {
			$this->parse($this->Query);
		} else {
			 $this->output_error(1);
			 return false;
		} //End if
		
		return true;
		
	} //End Write
	
	function parse($string) {
		
		/* First pass - tokenize and assign to token types */
		//preg_match_all('/(?:\'.+?\')|(?:\".+?\")|[\W]|\w+/', $string, $m);
		preg_match_all('/(?:(?<!\\\\)\'.+?(?<!\\\\)\')|(?:(?<!\\\\)\".+?(?<!\\\\)\")|[\W]|\w+/', $string, $m); //Backslash used for escaping
		$token_temp = $m[0];
		
		foreach ($token_temp as $i => $token) {
			$token = trim($token);
			if (!preg_match('/^\s*$/', $token)) { //Drop empty or whitespace
				$this->sequence[] = $token;
			}
		} //End foreach
		
		$this->tokenize(); //Assign a token type to each token
		
		/* Second pass - check for errors in syntax */
		if (!$this->check_parse_errors()) return false;
		if (!$this->get_query_type()) return false;
		
		return true;
		
	} //End parse
	
	function tokenize() {
		
		foreach ($this->sequence as $i => $token) {
			if (array_key_exists(strtolower($token), $this->types)) {
				$this->token_sequence[$i] = 'T_REGEX_TYPE';
			} elseif (preg_match('/^([\'\"]).*?\\1$/', $token)) {
				$this->token_sequence[$i] = 'T_STRING';
			} elseif (preg_match('/^\(.+?\)$/', $token)) {
				$this->token_sequence[$i] = 'T_GROUPED'; //Temporary - we want NONE of these when fully parsed
			} elseif ($token == '(') {
				$this->token_sequence[$i] = 'T_PAREN_OPEN';
			} elseif ($token == ')') {
				$this->token_sequence[$i] = 'T_PAREN_CLOSE';
			} elseif (preg_match('/(?:^\(.+)/', $token)) {
				$this->token_sequence[$i] = 'T_GROUP_START'; //Temporary - we want NONE of these when fully parsed
			} elseif (preg_match('/(?:.+\))$/', $token)) {
				$this->token_sequence[$i] = 'T_GROUP_END'; //Temporary - we want NONE of these when fully parsed
			} elseif (array_key_exists(strtolower($token), $this->metachars)) {
				$this->token_sequence[$i] = 'T_METACHAR';
			} elseif (preg_match('/^\d+$/', $token)) {
				$this->token_sequence[$i] = 'T_INTEGER';
			} elseif (in_array(strtolower($token), $this->operators)) {
				$this->token_sequence[$i] = 'T_OPER';
			} elseif (array_key_exists(strtolower($token), $this->assertions)) {
				$this->token_sequence[$i] = 'T_ASSERT';
			} elseif (in_array(strtolower($token), $this->objects)) {
				$this->token_sequence[$i] = 'T_OBJECT';
			} elseif (in_array(strtolower($token), $this->keywords)) {
				$this->token_sequence[$i] = 'T_KEYWORD';
			} elseif (in_array(strtolower($token), $this->functions)) {
				$this->token_sequence[$i] = 'T_FUNCTION';
			} elseif (preg_match('/^\w+$/', $token)) {
				$this->token_sequence[$i] = 'T_VAR';
			} else {
				$this->token_sequence[$i] = 'T_UNKNOWN'; //Bad....  used for parse error checking!
			}
		} //End foreach
		
		return true;
		
	} //End tokenize
Those bits basically do this....

Code: Select all

$SmartExp = new SmartExp;
$regex_query = "
	MATCH
	START OF LINE
	FOLLOWED BY
	\"foo bar yay\"
	FOLLOWED BY (
	3 NUMBERS
	AND
	OPTIONAL (
	4 N
	FOLLOWED BY
	'another \'string\'')
	) AS N
	FOLLOWED BY
	3 LETTERS
";

$SmartExp->Write($regex_query);

//Which sets the relevant parts of the object up as so....

/*
    [sequence] => Array
        (
            [0] => MATCH
            [1] => START
            [2] => OF
            [3] => LINE
            [4] => FOLLOWED
            [5] => BY
            [6] => "foo bar yay"
            [7] => FOLLOWED
            [8] => BY
            [9] => (
            [10] => 3
            [11] => NUMBERS
            [12] => AND
            [13] => OPTIONAL
            [14] => (
            [15] => 4
            [16] => N
            [17] => FOLLOWED
            [18] => BY
            [19] => 'another \'string\''
            [20] => )
            [21] => )
            [22] => AS
            [23] => N
            [24] => FOLLOWED
            [25] => BY
            [26] => 3
            [27] => LETTERS
        )

    [token_sequence] => Array
        (
            [0] => T_REGEX_TYPE
            [1] => T_ASSERT
            [2] => T_OPER
            [3] => T_OBJECT
            [4] => T_KEYWORD
            [5] => T_OPER
            [6] => T_STRING
            [7] => T_KEYWORD
            [8] => T_OPER
            [9] => T_PAREN_OPEN
            [10] => T_INTEGER
            [11] => T_METACHAR
            [12] => T_OPER
            [13] => T_FUNCTION
            [14] => T_PAREN_OPEN
            [15] => T_INTEGER
            [16] => T_VAR
            [17] => T_KEYWORD
            [18] => T_OPER
            [19] => T_STRING
            [20] => T_PAREN_CLOSE
            [21] => T_PAREN_CLOSE
            [22] => T_OPER
            [23] => T_VAR
            [24] => T_KEYWORD
            [25] => T_OPER
            [26] => T_INTEGER
            [27] => T_METACHAR
        )
*/
That's the parsing/tokenizing itself out of the way subject to further testing which enables me to work through the building of the regex pattern is a highly organised manner.

Thanks for the advice, it's much appreciated :D

Posted: Fri Jul 01, 2005 8:06 am
by Syranide
Great to hear, looking good btw :D

Hehe, yeah, debugging becomes a real piece of cake when using this solution (commonly if one wants to know position or so you need another array for it, this is why it is good to be able to reproduce the code from tokens as you can easily step through them, as you can easily say "near FOLLOWED BY" etc, but at the same time, it makes executing/translating much more tedious)

Posted: Fri Jul 01, 2005 8:18 am
by Chris Corbyn
Syranide wrote:Great to hear, looking good btw :D

Hehe, yeah, debugging becomes a real piece of cake when using this solution (commonly if one wants to know position or so you need another array for it, this is why it is good to be able to reproduce the code from tokens as you can easily step through them, as you can easily say "near FOLLOWED BY" etc, but at the same time, it makes executing/translating much more tedious)
Yes the debug errors are very SQL-esque...

Example... if I mismatch the parens it does this (like you say, its just so easy when you have it all tokenized)..
SmartExp error 5: Parsing failed near `) AS N FOLLOWED BY 3 LETTERS '. Unmatched parentheses offset in query

Posted: Sat Jul 02, 2005 3:45 pm
by Syranide
Sweeeet, does it produce any sensible result yet? :D

That debugging error is looking sweet, can't wait to do get my hands on the code and let it interpret all my filthy dirty errors ;) oooh... give it to me you filthy debugger!

I'm guessing you are doing a "human language" regex creator, how does it work out with the language? As with conditions and such could get pretty tedious to read when they are split far apart. (however, I guess common regex is worse though)

Posted: Sat Jul 02, 2005 4:04 pm
by Chris Corbyn
Syranide wrote:Sweeeet, does it produce any sensible result yet? :D
Nothing that I would be proud to show but I'm heading there. Basically put's it's "interpreted" output into an array which finally gets imploded when all done. That keeps everything tidy when working through the tokens and allows things to be shuffled about too.
Syranide wrote: That debugging error is looking sweet, can't wait to do get my hands on the code and let it interpret all my filthy dirty errors ;) oooh... give it to me you filthy debugger!
Thanks, might be a little while before I release the code (I'd estimate at least 1-2 months for a beta testing version).
Syranide wrote: I'm guessing you are doing a "human language" regex creator,...
Correct :D
Syranide wrote:... how does it work out with the language? As with conditions and such could get pretty tedious to read when they are split far apart. (however, I guess common regex is worse though)
It'll be very basic in version 1 since I'm an eager beaver and want to see if it's something that has potential to be used. But it will have plenty of functionality. You're right in thinking queries could get tediosuly long but it will also be more readable (and logical to regex haters) than stardard regex patterns.

Eventually I'll be providing a shorthand syntax option to trim things down (without starting to resmeble regex themselves).

Queries (it still doesn't sound right to call them that :?) would be best split over multiple lines to make them more readable but the parser doesn't care either way.

I'm open to input if anybody wishes to throw in their 2 cents with regards to user friendliness etc... I've yet to 100% define the SmartExp syntax (but it certainly will be issued much like mysql queries are) so comments on friendly syntax will be taken on board very gratefully (whether you know regex or not, the point of this project is to eliminate that requirement) ;)

Posted: Sat Jul 02, 2005 5:15 pm
by Syranide
d11wtq wrote:Queries (it still doesn't sound right to call them that :?) would be best split over multiple lines to make them more readable but the parser doesn't care either way.
Yeah I kinda forgot that it was possible to do, because that would really add to the functionality.

As you could have

Code: Select all

GET A FROM B
    IF A == C THEN SAY &quote;A&quote;
        NOW DO THIS AND THIS
    IF B == C THEN SAY &quote;C&quote;
for instance (but regex-style of course) (as this is more or less the way regex works when split over many lines)

a little option which could be good to include if you ask me would be some kind of template which you can insert, (custom of course) for instance you want to check if the next characters are "HELLO" or "WORLD", then you could have a tag instead e.g. "[HW]" which would be replaced "HW" => "'HELLO' OR 'WORLD'" so that you could easily reuse or make it more easily read. Also easily having "ISNAME" => "... something neat ..." to make it more readable.

In the way that you could make even more human readable and shorter regex by allowing the user to specify such tags by the use of an array or such. (Often you want to check if it is a quote, or an escaped quote, having to copy that 5 times makes you frustrated)

Just an experimental thought, which likely can have some areas which would prove difficult to solve properly. But, that would certainly be a feature I would like.

However, this could also be done by extending the class which would perhaps be more adviseable, in other words, having your regex class, extending it to take extended regex with tags that would be taken out and replaced with their respective content and then parsed. (as this would also remove certain troublesome functionality from the core)

As usual, just a thought ;)

Posted: Sat Jul 02, 2005 5:30 pm
by Chris Corbyn
If with the HW thingy you mean like aliasing "Hello World" to HW I already have it covered using AS (That's what the T_VAR token type is for ;))

Code: Select all

MATCH
3 NUMBERS AS foo
FOLLOWED BY
'xxx'
FOLLOWED BY foo

Code: Select all

/(\d{3})xxx\\1/
Hmm... I need to rethink how "followed by" is interpreted since to me it implies a positive lookahead rather than next part of string. Especially since I'm having "PRECEDED BY" and "NOT PRECEDED BY" for lookbehinds. Maybe the word "then" should imply the next part of the string is about to follow. Bah I dunno. Aha, just comma separating makes more sense there!

I don't want the core to get too fancy other than the fact it *should* theoretically create any regex even if the input is long and cumbersome. I could then start to write (and I'd be over the moon if others started to write) nice little extensions for it.

Cheers :)

EDIT | Updated thread title since it's not so *general* of a discussion anymore :)

Posted: Sat Jul 02, 2005 5:38 pm
by Syranide
We will most certainly be looking forward to this! :D
I'm really looking forward to see what becomes of this, not too commonly you see such developements for PHP as most as pretty straight forward and obvious (wee, I'm gonna do a login script (done more times than the number of inhabitants on earth)). As this really has great potential too as regex is really power, but so hard to understand at first ;)

One nifty feature that could perhaps be appreciated is automatically caching the produced regex to some other media (I myself use this for a lengthy process I'm doing, then saving the result to an include-file along with the crc32 for the input to find which one it is)... and thus allowing people integrate it into applications, rather than having to compile it and replace each time. CRC32:ing strings and checking if they exist (is precompiled) is actually really fast.

Of course being an extension ontop of the class (easily implemented).

Syranide says: "just a thought!" ;)