Page 1 of 1
Unable to get multiple matches
Posted: Thu Oct 09, 2008 4:17 am
by iinf
Dear friends,
I am writing following regex: (http\:\/\/news.*\&\;url\=)
to extract 2 matches from:
<url>
http://news.thirdparty.com/intl/en_us/i ... itle>State Department - Hindu</title><link>
http://news.thirdparty.com/news/url?sa= ... ubDate>Thu, 09 Oct 2008 03:51:34 GMT</pubDate><a href="
http://news.thirdparty.com/news/url?sa= ... xtkMZrEbjA">
But it is giving one match. Please guide me.
Re: Unable to get multiple matches
Posted: Thu Oct 09, 2008 4:28 am
by VladSun
The
.* will match the
longest match - i.e. the string between the first occurence of text1 and the
last occurence of text2.
To make it match the
shortest match you should use this:
Re: Unable to get multiple matches
Posted: Fri Oct 10, 2008 5:17 am
by GeertDD
iinf wrote:I am writing following regex: (http\:\/\/news.*\&\;url\=)
To make that regex look better, get rid of the needless backslashes. Backslashes are only needed for special regex characters. So you get the more readable regex below:
Now the fun part: optimization. By adding [^&]*+ right after "news", you possessively match all characters following until the first ampersand. This kills needless backtracking.
On your particular test string, the second regex was about 55% faster according to my tests.
Re: Unable to get multiple matches
Posted: Fri Oct 10, 2008 5:26 am
by VladSun
// offtopic
@GeertDD - I'm writing an IT game site and one of its sections is RegExps. The idea is to build the fastest regexp according to a specific problem. I just want you to know that some of the problems to solve and their solutions are based on your posts here in this forum
You have an amazing knowledge in regexps. Thanks for sharing it with us

Re: Unable to get multiple matches
Posted: Fri Oct 10, 2008 6:34 am
by prometheuzz
Geert, I'm assuming you forgot to remove the reluctant dot-star in that regex, I think you meant:
Correct?
Re: Unable to get multiple matches
Posted: Fri Oct 10, 2008 6:43 am
by GeertDD
@VladSun: Thank you for the kind words.
@prometheuzz: No, the .*? part still needs to be there. If you omit it, the regex won't correctly match URLs anymore. More specifically it will fail at the first "&" in the URL, unless that "&" is directly followed by "amp;url=". See?
Re: Unable to get multiple matches
Posted: Fri Oct 10, 2008 6:51 am
by prometheuzz
GeertDD wrote:@VladSun: Thank you for the kind words.
@prometheuzz: No, the .*? part still needs to be there. If you omit it, the regex won't correctly match URLs anymore. More specifically it will fail at the first "&" in the URL, unless that "&" is directly followed by "amp;url=". See?
Ah, okay, there could be an ampersand before the (sub)string "&" in that url.
Thanks.