I like the DOMDocument method, I'm going to go read up on that some. Here is a solution using the regex method:
Code: Select all
<?php
// Get the file into this variable, I just set it to grab CNN for testing...
$strFile = file_get_contents("http://www.cnn.com/");
// _1__ _2_ _3_
preg_match_all('%<a .*?href=("|\')(.*?)\1.*?>(.*?)</a>%i', $strFile, $aryMatch, PREG_PATTERN_ORDER);
// in $aryMatch:
// [0] is array of all complete matches
// [1] is array of the opening quote, either single or double, so it can match the closing
// [2] is array of the actual URL of the link
// [3] is array of the text for the link
if (isset($aryMatch[2]) && count($aryMatch[2]>0)) {
foreach ($aryMatch[2] as $key=>$strURL) {
$strLinkText = $aryMatch[3][$key]; // Added this for easier readability
echo ($key+1),': ';
if (preg_match('/^javascript:/i',$strURL)) {
echo "<strong><em>Javascript Call</em></strong><br>\n";
}
else {
echo htmlspecialchars($strLinkText),'<strong> LINKS TO </strong>'.$strURL,"<br>\n";
}
}
}
else {
echo "Sorry, no links found...";
}
?>
A note before you copy and paste that, the editor here kept changing the code on me, the line that has the preg_match for javascript, it is actually supposed to be
/^javascript:/i in there.
Another item you may want to consider, depending on your use of the data, is check to see if a link starts with #, which it just to link to an anchor on the same page. If it is, change it from
#whatever to be
/path/to/file#whatever
-Greg