regex for html links
Posted: Thu Jan 12, 2006 8:15 pm
I am working on a spider with multi_curl and am having trouble with the regex to find the links in the content. I am currently using this regex for it yet it does not grab them all for some reason. Could anyone please help me.
Code: Select all
function links($site){
//Pattern building across multiple lines to avoid page distortion.
$pattern = "/((@import\s+[\"'`]([\w:?=@&\/#._;-]+)[\"'`];)|";
$pattern .= "(:\s*url\s*\([\s\"'`]*([\w:?=@&\/#._;-]+)";
$pattern .= "([\s\"'`]*\))|<[^>]*\s+(src|href|url)\=[\s\"'`]*";
$pattern .= "([\w:?=@&\/#._;-]+)[\s\"'`]*[^>]*>))/i";
//End pattern building.
preg_match_all ($pattern, $site, $matches);
return (is_array($matches)) ? $matches:FALSE;
}