preg_match
Posted: Wed Nov 17, 2004 11:30 pm
im having a hard time understanding why this is failing
im trying to extract all links <a> from a page
take this example code
now $matches[0] has what i want in it, but it doesnt grab the links right.
most are correct, but sometimes it grabs wayyyy more than it should(or at least, more than i intend it to)
heres some sample output:
indices 22-27 are perfect, but indice 28 is the problem. and yes, that whole garbo of code is all for match #28, and theres lots more matches just like it.
why does it keep on matching past the closing </a> sometimes?
im trying to extract all links <a> from a page
take this example code
Code: Select all
$url = 'http://cnn.com';
$document = @file_get_contents($url);
preg_match_all('/<a(.*)>(.*)<\/a>/i', $document, $matches);
print_r($matches);most are correct, but sometimes it grabs wayyyy more than it should(or at least, more than i intend it to)
heres some sample output:
Code: Select all
ї22] => <a href="/EMAIL/">E-mail Newsletters</a>
ї23] => <a href="/youralerts/">Your E-mail Alerts</a>
ї24] => <a href="/togo/">CNNtoGO</a>
ї25] => <a href="/yourcommand/">TV Commercials</a>
ї26] => <a href="/feedback/">Contact Us</a>
ї27] => <a name="ContentArea"></a>
ї28] => <a href="/2004/TECH/science/11/17/carolina.dig/index.html" style="color:#000;">Dig find could reshape history, say scientists</a></h2></div><div style="background-color:#fff;"><img src="http://i.cnn.net/cnn/images/1.gif" alt="" width="1" height="10"></div> <a href="/2004/TECH/science/11/17/carolina.dig/index.html"><img src="http://i.a.cnn.net/cnn/2004/TECH/science/11/17/carolina.dig/top.2033.digging.usc.jpg"width="280" height="210" alt="Dig find could reshape history, say scientists" border="0" hspace="0" vspace="0"></a><div class="cnnMainT1"><p>A site in South Carolina may rewrite the history of how the Americas were settled by pushing back the date of the first human arrival by thousands of years, archaeologists say. But that interpretation is already igniting controversy among scientists. "If confirmed, then it really does have a significant impact on our previous understanding of New World colonization," said Theodore Schurr, anthropology professor at the University of Pennsylvania.</p><p><a href="/2004/TECH/science/11/17/carolina.dig/index.html" class="cnnt1link">FULL STORY</a></p><p>•&nbsp;<b><span class="cnnBodyText" style="font-weight:bold;color:#333;">Gallery: </span></b> <a href="javascript:CNN_openPopup('/interactive/tech/0411/gallery.carolina.dig/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');">Excavation evidence</a><br>•&nbsp;<b><span class="cnnBodyText" style="font-weight:bold;color:#333;">Video: </span><img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"></b> <a href="javascript:LaunchVideo('/tech/2004/11/17/sieberg.rewrite.history.cnn.','300k');">Digging up new clues</a><br>•&nbsp;<b><span class="cnnBodyText" style="font-weight:bold;color:#333;">Map: </span></b> <a href="javascript:CNN_openPopup('/interactive/maps/us/topper.site/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');">The dig site</a><br></p></div><!-- /T1 --></td><td rowspan="2" width="10"><img src="http://i.cnn.net/cnn/images/1.gif" alt="" width="10" height="1"></td><td width="344"><!-- T2 --><div><img src="http://i.a.cnn.net/cnn/.element/img/1.0/main/px_c00.gif" alt="" width="344" height="2"></div><table width="344" border="0" cellpadding="0" cellspacing="0"><tr><td width="261" class="cnnTabbedBoxHeader" style="padding-left:0px;"><span class="cnnBigPrint"><b>MORE NEWS</b></span></td><td width="83" align="right"><a href="/mostpopular/"><img src="http://i.a.cnn.net/cnn/.element/img/1.0/main/a_most_pop.gif" alt="Most Popular" width="83" height="16" border="0"></a></td></tr></table><div class="cnn6pxTRBpad" style="font-weight:bold;"><div class="cnnSectT2s"><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/ALLPOLITICS/11/17/clinton.opening.ap/index.html">Clinton library sees scandals as 'fight for power'</a> | <img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle">&nbsp;<a href="javascript:LaunchVideo('/politics/2004/11/17/crowley.clinton.library.cnn.','300k');">Video</a><br></div><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/WORLD/meast/11/17/hassoun.evidence/index.html">Probe of Marine's disappearance re-opened</a> | <img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle">&nbsp;<a href="javascript:LaunchVideo('/world/2004/11/17/starr.marine.mystery.affl.','300k');">Video</a><br></div><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/ALLPOLITICS/11/17/cia.memo/index.html">CIA denies staff ordered to 'back Bush'</a><br></div><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/ALLPOLITICS/11/17/agriculture.secretary/index.html">Dem's name aired for Cabinet post</a> | <a href="javascript:CNN_openPopup('/interactive/allpolitics/0411/gallery.cabinet/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');">Interactive</a><br></div><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/LAW/11/17/peterson/index.html">Peterson defense seeks new jury</a> | <img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle">&nbsp;<a href="javascript:LaunchVideo('/law/2004/11/13/dornin.peterson.look.ahead.ktvu.','300k');">Video</a><br></div><div class="cnnMainNewT2"> •&nbsp;<b><span class="cnnBodyText" style="font-weight:bold;">CNN/Money: </span></b> <a href="/money/2004/11/17/news/fortune500/sears_kmart/index.htm?cnn=yes">Kmart-Sears in $11 billion deal</a> | <img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle">&nbsp;<a href="javascript:LaunchVideo('/business/2004/11/17/snow.kmart.sears.merger.cnn.','300k');">Video</a><br></div><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/US/11/17/nude.newswoman.ap/index.html">News anchor appears nude</a><br></div><div class="cnnMainNewT2"> •&nbsp;<a href="/2004/SHOWBIZ/Movies/11/17/sexiest.man.reut/index.html">Jude Law named 'Sexiest Man Alive'</a><br> </div></div></div><!-- /T2 --><div><img src="http://i.cnn.net/cnn/images/1.gif" alt="" width="1" height="10"></div><!-- =========== CNN Radio/Video Box =========== --><table width="344" border="0" cellpadding="0" cellspacing="0"><tr><td colspan="5" bgcolor="#cccccc"><img src="http://i.a.cnn.net/cnn/images/1.gif" alt="" width="1" height="1" hspace="0" vspace="0"></td></tr><tr valign="top"><td width="1" bgcolor="#cccccc"><img src="http://i.a.cnn.net/cnn/images/1.gif" alt="" width="1" height="1" hspace="0" vspace="0"></td><td width="114"><div class="cnn6pxPad"><span class="cnnBigPrint" style="color:#C00;font-weight:bold;">CNN</span><span class="cnnBigPrint" style="color:#000;font-weight:bold;">RADIO</span><div class="cnnMainNewT2"><a href="javascript:CNN_openPopup('/audio/radio/preferences.html','radioplayer','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=200,height=124')">Latest updates</a>why does it keep on matching past the closing </a> sometimes?