Custom template variable goes too far into string

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Custom template variable goes too far into string

Post by the_d00d »

Hello, I am in the incredibly early stages of writing a quick template language, just for quick project and ran into a bit of a problem.

I want to match tags such as:

Code: Select all

<foobar>, <forbar,bar,foo> and <foobar,foo="<foo>",bar,foobar>
And the regex I have for this is:

Code: Select all

<(?:[a-z0-9_]*?)(?:\,(?:[a-z0-9_]*?)(?:\=(?:.*?))?)*?>
The place I am having the error is that the regex ends when, in the third example above, it matches the ending greater than tag of <foo>. So it would only find: <foobar,foo="<foo>

I have been scratching my head over this for a couple days and would appreciate some help. Also, I am looking to maybe speed this regex up a little - anyone know where I should start? i.e. which bits I should look into optimising?

Thanks!
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Re: Custom template variable goes too far into string

Post by the_d00d »

Anyone have any ideas? :oops:
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: Custom template variable goes too far into string

Post by mintedjo »

Can you post a bigger chunk of the template language? :-)
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Re: Custom template variable goes too far into string

Post by the_d00d »

mintedjo wrote:Can you post a bigger chunk of the template language? :-)
No-it's full of bugs and in a very raw state. One of the bugs being this one ):
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: Custom template variable goes too far into string

Post by mintedjo »

Haha

Well if you want help with the regex it would be better if i could see a larger and more varied chunk of the code so i know exactly what It has to deal with. xD
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Re: Custom template variable goes too far into string

Post by the_d00d »

mintedjo wrote:Haha

Well if you want help with the regex it would be better if i could see a larger and more varied chunk of the code so i know exactly what It has to deal with. xD
Umm, not quite sure what you mean, but okay.

Code: Select all

<name,type="array",value="Paul",value="George",value="John">
<foreach:name,key,value>
    <p>Name is <value> with a key index of <key>.</p>
</foreach>
That would output:

Code: Select all

<p>Name is Paul with a key index of 0.</p>
<p>Name is George with a key index of 1.</p>
<p>Name is John with a key index of 2.</p>
My regex used above works fine for everything here (with the exception of the<foreach:> which uses a slightly modified version). The problem is a scenario such as this:

Code: Select all

<name,value="Paul">
<fullname,value="<name> Smith">
<fullname>
The result that the regex, for the <fullname> variable would be:

Code: Select all

<fullname,value="<name>
The regex finds the first greater than character and ends. What I need is for, when between the quotes, the regex not to care about the greater than character.

Does that give you more understanding?
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: Custom template variable goes too far into string

Post by mintedjo »

Hi,

I can't do it!

Code: Select all

<(?:[a-z0-9_]*?)(?:\,([a-z0-9_]*?)(?:\=(?:(?:[^<>]|(?:<[^<>]*>))*))?)*?>
That seems to work for all examples you gave... but its horrible.
If your lucky a guy called prometheuzz might see your post and contribute.
Hes some kind of regex wizard.
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Re: Custom template variable goes too far into string

Post by the_d00d »

mintedjo wrote:That seems to work for all examples you gave... but its horrible.
Yes, it does make ones eyes hurt :(
mintedjo wrote:If your lucky a guy called prometheuzz might see your post and contribute.
Hes some kind of regex wizard.
Cool. I hope he, or some other regex guru, sees this plea for help!
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Custom template variable goes too far into string

Post by prometheuzz »

Okay, here's a possible way to tackle this:

Code: Select all

<?php
$text = <<< BLOCK
<foobar> text <forbar,bar,foo>  and  <foobar,foo="<foo>",bar,foobar>
<name,value="Paul">  more noise <fullname,value="<name> Smith"> foo 
<fullname> and an ultimate test: 
<fullname,value="<name<nested<more-nesting!!!>>> Smith"> ok, done.
BLOCK;
 
$regex = '/<(?:[^>]|.(?!(?:[^"]*"[^"]*")*[^"]*$))*>/';
 
if(preg_match_all($regex, $text, $matches)) {
  print_r($matches);
}
 
/* Output:
 
Array
(
    [0] => Array
        (
            [0] => <foobar>
            [1] => <forbar,bar,foo>
            [2] => <foobar,foo="<foo>",bar,foobar>
            [3] => <name,value="Paul">
            [4] => <fullname,value="<name> Smith">
            [5] => <fullname>
            [6] => <fullname,value="<name<nested<more-nesting!!!>>> Smith">
        )
 
)
 
*/
 
?>
The regex itself is amazingly simple (the logic that is). Let me explain:

Code: Select all

$regex = '/
  <                                      # match a "<"
    (?:                                  #   open non-capturing group 1   
      [^>]                               #     match any character except ">"
      |                                  #     OR
      .(?!(?:[^"]*"[^"]*")*[^"]*$)       #     any character that does not have an even number of double quotes in front of it
    )                                    #   close non-capturing group 1
    *                                    #   match group 1 zero or more times
  >                                      # match a ">"
/x';
Note that you can just copy and paste this in your code: the 'x' modifier will ignore white spaces and the comments in your regex-string.

But... there are some drawbacks (of course). When you're working with very large strings, it might slow down because of all the looking ahead. And also, I must stress that regex do NOT make good parsers. It really looks like you need a decent parser for this.

My 2 cents.

HTH
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Custom template variable goes too far into string

Post by prometheuzz »

Let me explain just a bit more about that trickery look-ahead voodoo (the .(?!(?:[^"]*"[^"]*")*[^"]*$) part):

Code: Select all

.(?!        # match any character, that when looking ahead does not have (negative look ahead):
  (?:       #   open non-capturing group 2
    [^"]*   #     matches zero or more characters of any type except double quotes
    "       #     matches a double quote
    [^"]*   #     matches zero or more characters of any type except double quotes
    "       #     matches a double quote
  )         #   close non-capturing group 2
  *         #   group 2, zero or more times 
  [^"]*     #   matches zero or more characters of any type except double quotes
  $         #   matches the end of the string
)           # end look ahead
So, in plain English this would read: "match any character (so also '<' and '>'!) that does not have an even number of double quotes when looking ahead towards the end of the string".

Come to think of it, this will also work on the examples you posted:

Code: Select all

$regex = '/<(?:[^>]|"[^"]*")*>/';
But again building a parser with only regexes will be messy (of course depending on the language you're parsing). But I guess you don't have any escapes inside your strings like this:

Code: Select all

<fullname,value="<name> \" Smith">
Because that would mean adjusting the regexes I posted and they will start to look like monsters that need to be replaced by a... parser! (you guessed it).
; )

HTH
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: Custom template variable goes too far into string

Post by mintedjo »

Told ya... the guys like a regex ninja!
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Custom template variable goes too far into string

Post by prometheuzz »

mintedjo wrote:Told ya... the guys like a regex ninja!
I'm perhaps a brown belt, GeertDD is the black belt around here!
; )
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Re: Custom template variable goes too far into string

Post by the_d00d »

prometheuzz wrote:...
Wow. Pretty intense. Thanks!

One problem I initially see is that I would have to do additional parsing once the regex finds everything - my initial regex broke everything up into nice manageable pieces which is why I thought I might just be able to amend the end of the initial regex :-)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Custom template variable goes too far into string

Post by prometheuzz »

the_d00d wrote:
prometheuzz wrote:...
Wow. Pretty intense. Thanks!
No problem.
the_d00d wrote:One problem I initially see is that I would have to do additional parsing once the regex finds everything - my initial regex broke everything up into nice manageable pieces which is why I thought I might just be able to amend the end of the initial regex :-)
Not exactly sure what you mean by that, but you could adjust your original regex with a part of my regex so that it will match the examples you posted:

Code: Select all

<(?:[a-z0-9_]*?)(?:\,(?:[a-z0-9_]*?)(?:\="[^"]*")?)*?>
Out of curiosity, are you implementing this in PHP or some other language?
the_d00d
Forum Newbie
Posts: 8
Joined: Sat Nov 22, 2008 8:18 am

Re: Custom template variable goes too far into string

Post by the_d00d »

prometheuzz wrote:

Code: Select all

<(?:[a-z0-9_]*?)(?:\,(?:[a-z0-9_]*?)(?:\="[^"]*")?)*?>
Sweet! I'm going to have to try that out tomorrow!
prometheuzz wrote:Out of curiosity, are you implementing this in PHP or some other language?
Yes - I am making the application in PHP.
Post Reply