Page 1 of 1

Simple regex question

Posted: Sat Jan 31, 2009 8:20 am
by alex.barylski
When matching HTML tags and/or their content I have seen this many times:

Code: Select all

[^>]+>
Why does that work and something like:

Code: Select all

>(.+)</
Would not work?

Re: Simple regex question

Posted: Sat Jan 31, 2009 9:03 am
by prometheuzz
PCSpectra wrote:When matching HTML tags and/or their content I have seen this many times:

Code: Select all

[^>]+>
Why does that work and something like:

Code: Select all

>(.+)</
Would not work?
The STAR and PLUS are greedy quantifiers, especially in combination with the DOT (which matches any character except new lines). As the name 'greedy' already implies: it "eats" as much as it can. So the regex ">(.+)</" means: "match a '>' followed by one or more characters of any type followed by a '</'." Now take the following string:

Code: Select all

"text <tag1>text1</tag1> more text <tag2>text2</tag2> and more text"
and you start matching with the regex '>(.+)</', this will happen:
- the '>' after 'tag1' will be matched;
- the '(.+)' will consume the rest of the string (all the way to the end or line);
- and lastly, the '(.+)' will need to give some matches up (the regex engine has to backtrack) so from the end of the string, the first occurrence of the last part of the regex '</' will be matched, which is the '</' after 'tag2'.

So, the regex '>(.+)</' will match the underlined part:

Code: Select all

"text <tag1[u][b]>text1</tag1> more text <tag2>text2</[/b][/u]tag2> and more text"
which is probably not what you intended!

Re: Simple regex question

Posted: Sat Jan 31, 2009 6:15 pm
by alex.barylski
Ahhh...OK thank you, that cleared things up. I knew it was something like that, but couldn't quite nail it. Thank you.

Re: Simple regex question

Posted: Sun Feb 01, 2009 1:08 am
by prometheuzz
PCSpectra wrote:Ahhh...OK thank you, that cleared things up. I knew it was something like that, but couldn't quite nail it. Thank you.
You're welcome.

Re: Simple regex question

Posted: Sun Feb 01, 2009 8:11 pm
by alex.barylski
Might I ask you another question? :)

Going back to my parsing/matching URI problem (the hand crafted version had a similar unanticipated bug -- if there is such thing -- and I think regex is the way to go).

If doing something like:

Code: Select all

#(.+)/(.+)\.html#
I would expect this to match URI of the form:

Code: Select all

folder/file.html
However it probably does more (or less) because of the greediness of the RE? So how would I make the (.+) match everything but ONLY once and terminate that group once a delimiter like '/' was stumbled upon?

I'm thinking something like:

Code: Select all

(.+{1})/
But I dought that would work as I need it to match any number of characters (not just letters but really ANYTHING other than '/' or whatever the delimiter happens to be).

Cheers,
Alex

Re: Simple regex question

Posted: Mon Feb 02, 2009 12:06 am
by alex.barylski
Update: Using a ? following the .+ seems to have done *some* of the trick:

Code: Select all

(.+?)\.html
Should match URI's of the form:

Code: Select all

file.html
But will also match:

Code: Select all

folder/file.html
Which is not expected. How do I stop the matching when a '/' is also found, something like (which doesn't work):

Code: Select all

$regex = "#^(.+?|[^/])\.html(.*)#";
EDIT |

I have tried something like the following as well:

Code: Select all

#^([.+|/*]?)\.html(.*)#

Re: Simple regex question

Posted: Mon Feb 02, 2009 3:40 am
by prometheuzz
PCSpectra wrote:Update: Using a ? following the .+ seems to have done *some* of the trick:

Code: Select all

(.+?)\.html
By placing a question mark after a greedy quantifier, you're making it "reluctant" (non-greedy). So that is correct, it will do some of the trick.
PCSpectra wrote:Should match URI's of the form:

Code: Select all

file.html
But will also match:

Code: Select all

folder/file.html
Which is not expected.


Yes, it also matches that last string because you're still using a DOT-PLUS, which means match any character (except new lines) one or more times. Whether this is done greedily or reluctantly, all those character will get "eaten" by it. The difference between ".+" and ".+?" is this:

Code: Select all

$input = 'abc/def/ghi/jkl';
 
$regex1 = '#.+/.+#';
/*
The first DOT-PLUS will "eat" the entire string and will then backtrack to the 
first '/' (backtracking to the first == last!). The second DOT-PLUS will then "consume" 
the rest of the string, resulting in these matches:
 
    .+    # matches 'abc/def/ghi'
    /     # matches '/' (the last slash)
    .+    # matches 'jkl'
*/
 
$regex2 = '#.+?/.+#';
/*
But now the first DOT-PLUS will "eat" the part of the string until the first 
slash is encountered and the second DOT-PLUS will then "consume" the rest 
of the string, resulting in these matches:
 
    .+    # matches 'abc'
    /     # matches '/' (the first slash)
    .+    # matches 'def/ghi/jkl'
*/
PCSpectra wrote:How do I stop the matching when a '/' is also found, something like (which doesn't work):

Code: Select all

$regex = "#^(.+?|[^/])\.html(.*)#";
I'm not sure what exactly you're trying to match/find (only file names?), perhaps you could clarify with a couple of examples?
PCSpectra wrote:EDIT |

I have tried something like the following as well:

Code: Select all

#^([.+|/*]?)\.html(.*)#
Remember that everything between [ and ] will only match one character and that the "normal" meta-characters have no special meaning inside them.
So, you this part of your regex:

Code: Select all

[.+|/*]
will match one of the following characters: '.', '+', '|', '/' or '*'.