Extracting href attributes

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
hessodreamy
Forum Commoner
Posts: 58
Joined: Wed Apr 20, 2005 8:11 am

Extracting href attributes

Post by hessodreamy »

Apologies for my dense-ness. I am trying to extract from a html page a list of all the hyperlinks, images and any un-tagged text. But im stuck on the first hurdle.

To get the url I am trying:

Code: Select all

$string = "<div align=\"left\"><a href=\"http://www.cjindustries.co.uk\" target=\"_parent\"><img src=\"b_home.jpg\" alt=\"[Home]\" border=\"none\"></a><a href=\"index.html\"><img src=\"b_catl.jpg\" alt=\"[Catalog]\" border=\"none\"></a></div>";

preg_match_all("/a href=\"(.*)\"/", $string, $out)
But this matches the entire string , giving me 2 results:

Code: Select all

Array
(
    [0] => Array
        (
            [0] => a href="http://www.cjindustries.co.uk" target="_parent"><img src="b_home.jpg" alt="[Home]" border="none"></a><a href="index.html"><img src="b_catl.jpg" alt="[Catalog]" border="none"
        )

    [1] => Array
        (
            [0] => http://www.cjindustries.co.uk" target="_parent"><img src="b_home.jpg" alt="[Home]" border="none"></a><a href="index.html"><img src="b_catl.jpg" alt="[Catalog]" border="none
        )

)
I want the matched string to end at the " after the url, however as you can see it ends at the LAST " on the line.

Can someone tell me where I am going wrong?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

tested in a regex sandbox.. not php

Code: Select all

preg_match_all('#<\s*a\s+(?:[A-Za-z0-9_]+(?:\s*=\s*(["\']?)(?:[^\\1]*?)\\1)?)*?\s*href\s*=\s*(["\']?)([^\\2]*?)\\2.*?>#', $string, $matches);
Post Reply