Unable to get multiple matches

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
iinf
Forum Newbie
Posts: 1
Joined: Thu Oct 09, 2008 4:16 am

Unable to get multiple matches

Post by iinf »

Dear friends,

I am writing following regex: (http\:\/\/news.*\&amp\;url\=)


to extract 2 matches from:

<url>http://news.thirdparty.com/intl/en_us/i ... itle>State Department - Hindu</title><link>http://news.thirdparty.com/news/url?sa= ... ubDate>Thu, 09 Oct 2008 03:51:34 GMT</pubDate><a href="http://news.thirdparty.com/news/url?sa= ... xtkMZrEbjA">



But it is giving one match. Please guide me.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Unable to get multiple matches

Post by VladSun »

Code: Select all

text1.*text2
The .* will match the longest match - i.e. the string between the first occurence of text1 and the last occurence of text2.
To make it match the shortest match you should use this:

Code: Select all

text1.*?text2
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Unable to get multiple matches

Post by GeertDD »

iinf wrote:I am writing following regex: (http\:\/\/news.*\&amp\;url\=)
To make that regex look better, get rid of the needless backslashes. Backslashes are only needed for special regex characters. So you get the more readable regex below:

Code: Select all

http://news.*?&url=
Now the fun part: optimization. By adding [^&]*+ right after "news", you possessively match all characters following until the first ampersand. This kills needless backtracking.

Code: Select all

http://news[^&]*+.*?&url=
On your particular test string, the second regex was about 55% faster according to my tests.
Last edited by GeertDD on Fri Oct 10, 2008 6:21 am, edited 1 time in total.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Unable to get multiple matches

Post by VladSun »

// offtopic
@GeertDD - I'm writing an IT game site and one of its sections is RegExps. The idea is to build the fastest regexp according to a specific problem. I just want you to know that some of the problems to solve and their solutions are based on your posts here in this forum :)

You have an amazing knowledge in regexps. Thanks for sharing it with us :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Unable to get multiple matches

Post by prometheuzz »

GeertDD wrote:...

Code: Select all

http://news[^&]*+.*?&url=
...
Geert, I'm assuming you forgot to remove the reluctant dot-star in that regex, I think you meant:

Code: Select all

http://news[^&]*+&url=
Correct?
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Unable to get multiple matches

Post by GeertDD »

@VladSun: Thank you for the kind words.

@prometheuzz: No, the .*? part still needs to be there. If you omit it, the regex won't correctly match URLs anymore. More specifically it will fail at the first "&" in the URL, unless that "&" is directly followed by "amp;url=". See?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Unable to get multiple matches

Post by prometheuzz »

GeertDD wrote:@VladSun: Thank you for the kind words.

@prometheuzz: No, the .*? part still needs to be there. If you omit it, the regex won't correctly match URLs anymore. More specifically it will fail at the first "&" in the URL, unless that "&" is directly followed by "amp;url=". See?
Ah, okay, there could be an ampersand before the (sub)string "&amp" in that url.

Thanks.
Post Reply