Simple regex question

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Simple regex question

Post by alex.barylski »

When matching HTML tags and/or their content I have seen this many times:

Code: Select all

[^>]+>
Why does that work and something like:

Code: Select all

>(.+)</
Would not work?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Simple regex question

Post by prometheuzz »

PCSpectra wrote:When matching HTML tags and/or their content I have seen this many times:

Code: Select all

[^>]+>
Why does that work and something like:

Code: Select all

>(.+)</
Would not work?
The STAR and PLUS are greedy quantifiers, especially in combination with the DOT (which matches any character except new lines). As the name 'greedy' already implies: it "eats" as much as it can. So the regex ">(.+)</" means: "match a '>' followed by one or more characters of any type followed by a '</'." Now take the following string:

Code: Select all

"text <tag1>text1</tag1> more text <tag2>text2</tag2> and more text"
and you start matching with the regex '>(.+)</', this will happen:
- the '>' after 'tag1' will be matched;
- the '(.+)' will consume the rest of the string (all the way to the end or line);
- and lastly, the '(.+)' will need to give some matches up (the regex engine has to backtrack) so from the end of the string, the first occurrence of the last part of the regex '</' will be matched, which is the '</' after 'tag2'.

So, the regex '>(.+)</' will match the underlined part:

Code: Select all

"text <tag1[u][b]>text1</tag1> more text <tag2>text2</[/b][/u]tag2> and more text"
which is probably not what you intended!
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Simple regex question

Post by alex.barylski »

Ahhh...OK thank you, that cleared things up. I knew it was something like that, but couldn't quite nail it. Thank you.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Simple regex question

Post by prometheuzz »

PCSpectra wrote:Ahhh...OK thank you, that cleared things up. I knew it was something like that, but couldn't quite nail it. Thank you.
You're welcome.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Simple regex question

Post by alex.barylski »

Might I ask you another question? :)

Going back to my parsing/matching URI problem (the hand crafted version had a similar unanticipated bug -- if there is such thing -- and I think regex is the way to go).

If doing something like:

Code: Select all

#(.+)/(.+)\.html#
I would expect this to match URI of the form:

Code: Select all

folder/file.html
However it probably does more (or less) because of the greediness of the RE? So how would I make the (.+) match everything but ONLY once and terminate that group once a delimiter like '/' was stumbled upon?

I'm thinking something like:

Code: Select all

(.+{1})/
But I dought that would work as I need it to match any number of characters (not just letters but really ANYTHING other than '/' or whatever the delimiter happens to be).

Cheers,
Alex
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Simple regex question

Post by alex.barylski »

Update: Using a ? following the .+ seems to have done *some* of the trick:

Code: Select all

(.+?)\.html
Should match URI's of the form:

Code: Select all

file.html
But will also match:

Code: Select all

folder/file.html
Which is not expected. How do I stop the matching when a '/' is also found, something like (which doesn't work):

Code: Select all

$regex = "#^(.+?|[^/])\.html(.*)#";
EDIT |

I have tried something like the following as well:

Code: Select all

#^([.+|/*]?)\.html(.*)#
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Simple regex question

Post by prometheuzz »

PCSpectra wrote:Update: Using a ? following the .+ seems to have done *some* of the trick:

Code: Select all

(.+?)\.html
By placing a question mark after a greedy quantifier, you're making it "reluctant" (non-greedy). So that is correct, it will do some of the trick.
PCSpectra wrote:Should match URI's of the form:

Code: Select all

file.html
But will also match:

Code: Select all

folder/file.html
Which is not expected.


Yes, it also matches that last string because you're still using a DOT-PLUS, which means match any character (except new lines) one or more times. Whether this is done greedily or reluctantly, all those character will get "eaten" by it. The difference between ".+" and ".+?" is this:

Code: Select all

$input = 'abc/def/ghi/jkl';
 
$regex1 = '#.+/.+#';
/*
The first DOT-PLUS will "eat" the entire string and will then backtrack to the 
first '/' (backtracking to the first == last!). The second DOT-PLUS will then "consume" 
the rest of the string, resulting in these matches:
 
    .+    # matches 'abc/def/ghi'
    /     # matches '/' (the last slash)
    .+    # matches 'jkl'
*/
 
$regex2 = '#.+?/.+#';
/*
But now the first DOT-PLUS will "eat" the part of the string until the first 
slash is encountered and the second DOT-PLUS will then "consume" the rest 
of the string, resulting in these matches:
 
    .+    # matches 'abc'
    /     # matches '/' (the first slash)
    .+    # matches 'def/ghi/jkl'
*/
PCSpectra wrote:How do I stop the matching when a '/' is also found, something like (which doesn't work):

Code: Select all

$regex = "#^(.+?|[^/])\.html(.*)#";
I'm not sure what exactly you're trying to match/find (only file names?), perhaps you could clarify with a couple of examples?
PCSpectra wrote:EDIT |

I have tried something like the following as well:

Code: Select all

#^([.+|/*]?)\.html(.*)#
Remember that everything between [ and ] will only match one character and that the "normal" meta-characters have no special meaning inside them.
So, you this part of your regex:

Code: Select all

[.+|/*]
will match one of the following characters: '.', '+', '|', '/' or '*'.
Post Reply