Page 1 of 1
Custom template variable goes too far into string
Posted: Sat Nov 22, 2008 8:26 am
by the_d00d
Hello, I am in the incredibly early stages of writing a quick template language, just for quick project and ran into a bit of a problem.
I want to match tags such as:
Code: Select all
<foobar>, <forbar,bar,foo> and <foobar,foo="<foo>",bar,foobar>
And the regex I have for this is:
Code: Select all
<(?:[a-z0-9_]*?)(?:\,(?:[a-z0-9_]*?)(?:\=(?:.*?))?)*?>
The place I am having the error is that the regex ends when, in the third example above, it matches the ending greater than tag of <foo>. So it would only find: <foobar,foo="<foo>
I have been scratching my head over this for a couple days and would appreciate some help. Also, I am looking to maybe speed this regex up a little - anyone know where I should start? i.e. which bits I should look into optimising?
Thanks!
Re: Custom template variable goes too far into string
Posted: Tue Nov 25, 2008 1:40 pm
by the_d00d
Anyone have any ideas?

Re: Custom template variable goes too far into string
Posted: Wed Nov 26, 2008 3:27 am
by mintedjo
Can you post a bigger chunk of the template language?

Re: Custom template variable goes too far into string
Posted: Wed Nov 26, 2008 10:38 am
by the_d00d
mintedjo wrote:Can you post a bigger chunk of the template language?

No-it's full of bugs and in a very raw state. One of the bugs being this one ):
Re: Custom template variable goes too far into string
Posted: Wed Nov 26, 2008 10:39 am
by mintedjo
Haha
Well if you want help with the regex it would be better if i could see a larger and more varied chunk of the code so i know exactly what It has to deal with. xD
Re: Custom template variable goes too far into string
Posted: Thu Nov 27, 2008 3:04 am
by the_d00d
mintedjo wrote:Haha
Well if you want help with the regex it would be better if i could see a larger and more varied chunk of the code so i know exactly what It has to deal with. xD
Umm, not quite sure what you mean, but okay.
Code: Select all
<name,type="array",value="Paul",value="George",value="John">
<foreach:name,key,value>
<p>Name is <value> with a key index of <key>.</p>
</foreach>
That would output:
Code: Select all
<p>Name is Paul with a key index of 0.</p>
<p>Name is George with a key index of 1.</p>
<p>Name is John with a key index of 2.</p>
My regex used above works fine for everything here (with the exception of the<foreach:> which uses a slightly modified version). The problem is a scenario such as this:
Code: Select all
<name,value="Paul">
<fullname,value="<name> Smith">
<fullname>
The result that the regex, for the <fullname> variable would be:
The regex finds the first greater than character and ends. What I need is for, when between the quotes, the regex not to care about the greater than character.
Does that give you more understanding?
Re: Custom template variable goes too far into string
Posted: Thu Nov 27, 2008 4:21 am
by mintedjo
Hi,
I can't do it!
Code: Select all
<(?:[a-z0-9_]*?)(?:\,([a-z0-9_]*?)(?:\=(?:(?:[^<>]|(?:<[^<>]*>))*))?)*?>
That seems to work for all examples you gave... but its horrible.
If your lucky a guy called prometheuzz might see your post and contribute.
Hes some kind of regex wizard.
Re: Custom template variable goes too far into string
Posted: Thu Nov 27, 2008 4:27 am
by the_d00d
mintedjo wrote:That seems to work for all examples you gave... but its horrible.
Yes, it does make ones eyes hurt
mintedjo wrote:If your lucky a guy called prometheuzz might see your post and contribute.
Hes some kind of regex wizard.
Cool. I hope he, or some other regex guru, sees this plea for help!
Re: Custom template variable goes too far into string
Posted: Mon Dec 01, 2008 2:58 pm
by prometheuzz
Okay, here's a possible way to tackle this:
Code: Select all
<?php
$text = <<< BLOCK
<foobar> text <forbar,bar,foo> and <foobar,foo="<foo>",bar,foobar>
<name,value="Paul"> more noise <fullname,value="<name> Smith"> foo
<fullname> and an ultimate test:
<fullname,value="<name<nested<more-nesting!!!>>> Smith"> ok, done.
BLOCK;
$regex = '/<(?:[^>]|.(?!(?:[^"]*"[^"]*")*[^"]*$))*>/';
if(preg_match_all($regex, $text, $matches)) {
print_r($matches);
}
/* Output:
Array
(
[0] => Array
(
[0] => <foobar>
[1] => <forbar,bar,foo>
[2] => <foobar,foo="<foo>",bar,foobar>
[3] => <name,value="Paul">
[4] => <fullname,value="<name> Smith">
[5] => <fullname>
[6] => <fullname,value="<name<nested<more-nesting!!!>>> Smith">
)
)
*/
?>
The regex itself is amazingly simple (the logic that is). Let me explain:
Code: Select all
$regex = '/
< # match a "<"
(?: # open non-capturing group 1
[^>] # match any character except ">"
| # OR
.(?!(?:[^"]*"[^"]*")*[^"]*$) # any character that does not have an even number of double quotes in front of it
) # close non-capturing group 1
* # match group 1 zero or more times
> # match a ">"
/x';
Note that you can just copy and paste this in your code: the 'x' modifier will ignore white spaces and the comments in your regex-string.
But... there are some drawbacks (of course). When you're working with very large strings, it might slow down because of all the looking ahead. And also, I must stress that regex do NOT make good parsers. It really looks like you need a decent parser for this.
My 2 cents.
HTH
Re: Custom template variable goes too far into string
Posted: Mon Dec 01, 2008 3:07 pm
by prometheuzz
Let me explain just a bit more about that trickery look-ahead voodoo (the .(?!(?:[^"]*"[^"]*")*[^"]*$) part):
Code: Select all
.(?! # match any character, that when looking ahead does not have (negative look ahead):
(?: # open non-capturing group 2
[^"]* # matches zero or more characters of any type except double quotes
" # matches a double quote
[^"]* # matches zero or more characters of any type except double quotes
" # matches a double quote
) # close non-capturing group 2
* # group 2, zero or more times
[^"]* # matches zero or more characters of any type except double quotes
$ # matches the end of the string
) # end look ahead
So, in plain English this would read:
"match any character (so also '<' and '>'!) that does not have an even number of double quotes when looking ahead towards the end of the string".
Come to think of it, this will also work on the examples you posted:
But again building a parser with only regexes will be messy (of course depending on the language you're parsing). But I guess you don't have any escapes inside your strings like this:
Code: Select all
<fullname,value="<name> \" Smith">
Because that would mean adjusting the regexes I posted and they will start to look like monsters that need to be replaced by a... parser! (you guessed it).
; )
HTH
Re: Custom template variable goes too far into string
Posted: Wed Dec 03, 2008 10:09 am
by mintedjo
Told ya... the guys like a regex ninja!
Re: Custom template variable goes too far into string
Posted: Wed Dec 03, 2008 10:23 am
by prometheuzz
mintedjo wrote:Told ya... the guys like a regex ninja!
I'm perhaps a brown belt, GeertDD is the black belt around here!
; )
Re: Custom template variable goes too far into string
Posted: Sat Dec 06, 2008 11:37 am
by the_d00d
prometheuzz wrote:...
Wow. Pretty intense. Thanks!
One problem I initially see is that I would have to do additional parsing once the regex finds everything - my initial regex broke everything up into nice manageable pieces which is why I thought I might just be able to amend the end of the initial regex

Re: Custom template variable goes too far into string
Posted: Sat Dec 06, 2008 11:47 am
by prometheuzz
the_d00d wrote:prometheuzz wrote:...
Wow. Pretty intense. Thanks!
No problem.
the_d00d wrote:One problem I initially see is that I would have to do additional parsing once the regex finds everything - my initial regex broke everything up into nice manageable pieces which is why I thought I might just be able to amend the end of the initial regex

Not exactly sure what you mean by that, but you could adjust your original regex with a part of my regex so that it will match the examples you posted:
Code: Select all
<(?:[a-z0-9_]*?)(?:\,(?:[a-z0-9_]*?)(?:\="[^"]*")?)*?>
Out of curiosity, are you implementing this in PHP or some other language?
Re: Custom template variable goes too far into string
Posted: Sat Dec 06, 2008 2:53 pm
by the_d00d
prometheuzz wrote:Code: Select all
<(?:[a-z0-9_]*?)(?:\,(?:[a-z0-9_]*?)(?:\="[^"]*")?)*?>
Sweet! I'm going to have to try that out tomorrow!
prometheuzz wrote:Out of curiosity, are you implementing this in PHP or some other language?
Yes - I am making the application in PHP.