Page 1 of 1
Simple regex question
Posted: Sat Jan 31, 2009 8:20 am
by alex.barylski
When matching HTML tags and/or their content I have seen this many times:
Why does that work and something like:
Would not work?
Re: Simple regex question
Posted: Sat Jan 31, 2009 9:03 am
by prometheuzz
PCSpectra wrote:When matching HTML tags and/or their content I have seen this many times:
Why does that work and something like:
Would not work?
The STAR and PLUS are greedy quantifiers, especially in combination with the DOT (which matches any character except new lines). As the name 'greedy' already implies: it "eats" as much as it can. So the regex "
>(.+)</" means: "match a '
>' followed by one or more characters of any type followed by a '
</'." Now take the following string:
Code: Select all
"text <tag1>text1</tag1> more text <tag2>text2</tag2> and more text"
and you start matching with the regex '
>(.+)</', this will happen:
- the '
>' after '
tag1' will be matched;
- the '
(.+)' will consume the rest of the string (all the way to the end or line);
- and lastly, the '
(.+)' will need to give some matches up (the regex engine has to backtrack) so from the end of the string, the first occurrence of the last part of the regex '
</' will be matched, which is the '
</' after '
tag2'.
So, the regex '
>(.+)</' will match the underlined part:
Code: Select all
"text <tag1[u][b]>text1</tag1> more text <tag2>text2</[/b][/u]tag2> and more text"
which is probably not what you intended!
Re: Simple regex question
Posted: Sat Jan 31, 2009 6:15 pm
by alex.barylski
Ahhh...OK thank you, that cleared things up. I knew it was something like that, but couldn't quite nail it. Thank you.
Re: Simple regex question
Posted: Sun Feb 01, 2009 1:08 am
by prometheuzz
PCSpectra wrote:Ahhh...OK thank you, that cleared things up. I knew it was something like that, but couldn't quite nail it. Thank you.
You're welcome.
Re: Simple regex question
Posted: Sun Feb 01, 2009 8:11 pm
by alex.barylski
Might I ask you another question?
Going back to my parsing/matching URI problem (the hand crafted version had a similar unanticipated bug -- if there is such thing -- and I think regex is the way to go).
If doing something like:
I would expect this to match URI of the form:
However it probably does more (or less) because of the greediness of the RE? So how would I make the (.+) match everything but ONLY once and terminate that group once a delimiter like '/' was stumbled upon?
I'm thinking something like:
But I dought that would work as I need it to match any number of characters (not just letters but really ANYTHING other than '/' or whatever the delimiter happens to be).
Cheers,
Alex
Re: Simple regex question
Posted: Mon Feb 02, 2009 12:06 am
by alex.barylski
Update: Using a ? following the .+ seems to have done *some* of the trick:
Should match URI's of the form:
But will also match:
Which is not expected. How do I stop the matching when a '/' is also found, something like (which doesn't work):
Code: Select all
$regex = "#^(.+?|[^/])\.html(.*)#";
EDIT |
I have tried something like the following as well:
Re: Simple regex question
Posted: Mon Feb 02, 2009 3:40 am
by prometheuzz
PCSpectra wrote:Update: Using a ? following the .+ seems to have done *some* of the trick:
By placing a question mark after a greedy quantifier, you're making it "reluctant" (non-greedy). So that is correct, it will do some of the trick.
PCSpectra wrote:Should match URI's of the form:
But will also match:
Which is not expected.
Yes, it also matches that last string because you're still using a DOT-PLUS, which means match any character (except new lines) one or more times. Whether this is done greedily or reluctantly, all those character will get "eaten" by it. The difference between ".+" and ".+?" is this:
Code: Select all
$input = 'abc/def/ghi/jkl';
$regex1 = '#.+/.+#';
/*
The first DOT-PLUS will "eat" the entire string and will then backtrack to the
first '/' (backtracking to the first == last!). The second DOT-PLUS will then "consume"
the rest of the string, resulting in these matches:
.+ # matches 'abc/def/ghi'
/ # matches '/' (the last slash)
.+ # matches 'jkl'
*/
$regex2 = '#.+?/.+#';
/*
But now the first DOT-PLUS will "eat" the part of the string until the first
slash is encountered and the second DOT-PLUS will then "consume" the rest
of the string, resulting in these matches:
.+ # matches 'abc'
/ # matches '/' (the first slash)
.+ # matches 'def/ghi/jkl'
*/
PCSpectra wrote:How do I stop the matching when a '/' is also found, something like (which doesn't work):
Code: Select all
$regex = "#^(.+?|[^/])\.html(.*)#";
I'm not sure what exactly you're trying to match/find (only file names?), perhaps you could clarify with a couple of examples?
PCSpectra wrote:EDIT |
I have tried something like the following as well:
Remember that everything between [ and ] will only match one character and that the "normal" meta-characters have no special meaning inside them.
So, you this part of your regex:
will match one of the following characters: '.', '+', '|', '/' or '*'.