Page 1 of 1

Regular expression pattern match

Posted: Fri Aug 06, 2004 9:37 pm
by Chris Corbyn
Can anyone see a problem with this pattern match for extracting link url's from hyperlinks in html documents?
<a href="(.+)">
it doesn't seem to stop copying data at the second double quote for the first link on a page. It waits until it reaches the secound double quote of the second link on the page and then works as expected.

Any better offers (doesn't matter about whitespace or case etc just assume format to be strictly <a href="somelink.html">)

Thanks in advance :-)

Posted: Fri Aug 06, 2004 10:06 pm
by feyd
your regex is set to be greedy.. try this:

Code: Select all

<a href="(.+?)">

Posted: Fri Aug 06, 2004 10:07 pm
by Chris Corbyn
No just realised my problem is that if any two links occur on the same line i get the problems stated. Otherwise it works fine. I'm missing something vital in my pattern match that tells it when it's reached the end of the link and to start again on the same line. Could it be \b? I'll test it....

Posted: Fri Aug 06, 2004 10:14 pm
by feyd
try my regex..

Posted: Fri Aug 06, 2004 10:16 pm
by Chris Corbyn
Thanks feyd... it's improved but now if there happens to be two links on the same line it only reads to the first one. What do I use to make it continue along the line? I thought it was "g" for global?

In perl it would be
/<a href="(.+?)">/gi
right or am I wrong about the g? Seems to do the same thing in perl :-(

Thanks again

Posted: Fri Aug 06, 2004 10:19 pm
by feyd
preg_match_all

Posted: Fri Aug 06, 2004 10:22 pm
by Chris Corbyn
Cheers! :-) You're a clever guy ;-)

Posted: Fri Aug 06, 2004 10:23 pm
by feyd
for some reason the PCRE functions don't support g, which I find kinda silly.. but I guess, for the most part, you want all of them to operate globally anyways.. I dunno.. :?