Page 1 of 1

regex for html links

Posted: Thu Jan 12, 2006 8:15 pm
by Extremest
I am working on a spider with multi_curl and am having trouble with the regex to find the links in the content. I am currently using this regex for it yet it does not grab them all for some reason. Could anyone please help me.

Code: Select all

function links($site){ 
//Pattern building across multiple lines to avoid page distortion.
$pattern = "/((@import\s+[\"'`]([\w:?=@&\/#._;-]+)[\"'`];)|";
$pattern .= "(:\s*url\s*\([\s\"'`]*([\w:?=@&\/#._;-]+)";
$pattern .= "([\s\"'`]*\))|<[^>]*\s+(src|href|url)\=[\s\"'`]*";
$pattern .= "([\w:?=@&\/#._;-]+)[\s\"'`]*[^>]*>))/i";
//End pattern building.
preg_match_all ($pattern, $site, $matches);
return (is_array($matches)) ? $matches:FALSE;
}

Posted: Thu Jan 12, 2006 8:46 pm
by Extremest
I am sorry that regex is fine. I have got that working fine. Just having some problems with removing the ones that I don't want. For some reason it is removing some that are fine and there is not even a match.