Page 1 of 1

preg_match

Posted: Wed Nov 17, 2004 11:30 pm
by rehfeld
im having a hard time understanding why this is failing

im trying to extract all links <a> from a page

take this example code

Code: Select all

$url = 'http://cnn.com';

$document = @file_get_contents($url);

preg_match_all('/<a(.*)>(.*)<\/a>/i', $document, $matches);

print_r($matches);
now $matches[0] has what i want in it, but it doesnt grab the links right.


most are correct, but sometimes it grabs wayyyy more than it should(or at least, more than i intend it to)

heres some sample output:

Code: Select all

&#1111;22] =&gt; &lt;a href="/EMAIL/"&gt;E-mail Newsletters&lt;/a&gt;

            &#1111;23] =&gt; &lt;a href="/youralerts/"&gt;Your E-mail Alerts&lt;/a&gt;
            &#1111;24] =&gt; &lt;a href="/togo/"&gt;CNNtoGO&lt;/a&gt;
            &#1111;25] =&gt; &lt;a href="/yourcommand/"&gt;TV Commercials&lt;/a&gt;
            &#1111;26] =&gt; &lt;a href="/feedback/"&gt;Contact Us&lt;/a&gt;
            &#1111;27] =&gt; &lt;a name="ContentArea"&gt;&lt;/a&gt;

            &#1111;28] =&gt; &lt;a href="/2004/TECH/science/11/17/carolina.dig/index.html" style="color:#000;"&gt;Dig find could reshape history, say scientists&lt;/a&gt;&lt;/h2&gt;&lt;/div&gt;&lt;div style="background-color:#fff;"&gt;&lt;img src="http://i.cnn.net/cnn/images/1.gif" alt="" width="1" height="10"&gt;&lt;/div&gt;    &lt;a href="/2004/TECH/science/11/17/carolina.dig/index.html"&gt;&lt;img src="http://i.a.cnn.net/cnn/2004/TECH/science/11/17/carolina.dig/top.2033.digging.usc.jpg"width="280" height="210" alt="Dig find could reshape history, say scientists" border="0" hspace="0" vspace="0"&gt;&lt;/a&gt;&lt;div class="cnnMainT1"&gt;&lt;p&gt;A site in South Carolina may rewrite the history of how the Americas were settled by pushing back the date of the first human arrival by thousands of years, archaeologists say. But that interpretation is already igniting controversy among scientists. "If confirmed, then it really does have a significant impact on our previous understanding of New World colonization," said Theodore Schurr, anthropology professor at the University of Pennsylvania.&lt;/p&gt;&lt;p&gt;&lt;a href="/2004/TECH/science/11/17/carolina.dig/index.html" class="cnnt1link"&gt;FULL STORY&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&#149;&amp;nbsp;&lt;b&gt;&lt;span class="cnnBodyText" style="font-weight:bold;color:#333;"&gt;Gallery: &lt;/span&gt;&lt;/b&gt; &lt;a href="javascript:CNN_openPopup('/interactive/tech/0411/gallery.carolina.dig/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');"&gt;Excavation evidence&lt;/a&gt;&lt;br&gt;&#149;&amp;nbsp;&lt;b&gt;&lt;span class="cnnBodyText" style="font-weight:bold;color:#333;"&gt;Video: &lt;/span&gt;&lt;img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"&gt;&lt;/b&gt; &lt;a href="javascript:LaunchVideo('/tech/2004/11/17/sieberg.rewrite.history.cnn.','300k');"&gt;Digging up new clues&lt;/a&gt;&lt;br&gt;&#149;&amp;nbsp;&lt;b&gt;&lt;span class="cnnBodyText" style="font-weight:bold;color:#333;"&gt;Map: &lt;/span&gt;&lt;/b&gt; &lt;a href="javascript:CNN_openPopup('/interactive/maps/us/topper.site/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');"&gt;The dig site&lt;/a&gt;&lt;br&gt;&lt;/p&gt;&lt;/div&gt;&lt;!-- /T1 --&gt;&lt;/td&gt;&lt;td rowspan="2" width="10"&gt;&lt;img src="http://i.cnn.net/cnn/images/1.gif" alt="" width="10" height="1"&gt;&lt;/td&gt;&lt;td width="344"&gt;&lt;!-- T2 --&gt;&lt;div&gt;&lt;img src="http://i.a.cnn.net/cnn/.element/img/1.0/main/px_c00.gif" alt="" width="344" height="2"&gt;&lt;/div&gt;&lt;table width="344" border="0" cellpadding="0" cellspacing="0"&gt;&lt;tr&gt;&lt;td width="261" class="cnnTabbedBoxHeader" style="padding-left:0px;"&gt;&lt;span class="cnnBigPrint"&gt;&lt;b&gt;MORE NEWS&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td width="83" align="right"&gt;&lt;a href="/mostpopular/"&gt;&lt;img src="http://i.a.cnn.net/cnn/.element/img/1.0/main/a_most_pop.gif" alt="Most Popular" width="83" height="16" border="0"&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;div class="cnn6pxTRBpad" style="font-weight:bold;"&gt;&lt;div class="cnnSectT2s"&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/ALLPOLITICS/11/17/clinton.opening.ap/index.html"&gt;Clinton library sees scandals as 'fight for power'&lt;/a&gt; | &lt;img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"&gt;&amp;nbsp;&lt;a href="javascript:LaunchVideo('/politics/2004/11/17/crowley.clinton.library.cnn.','300k');"&gt;Video&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/WORLD/meast/11/17/hassoun.evidence/index.html"&gt;Probe of Marine's disappearance re-opened&lt;/a&gt; | &lt;img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"&gt;&amp;nbsp;&lt;a href="javascript:LaunchVideo('/world/2004/11/17/starr.marine.mystery.affl.','300k');"&gt;Video&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/ALLPOLITICS/11/17/cia.memo/index.html"&gt;CIA denies staff ordered to 'back Bush'&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/ALLPOLITICS/11/17/agriculture.secretary/index.html"&gt;Dem's name aired for Cabinet post&lt;/a&gt; | &lt;a href="javascript:CNN_openPopup('/interactive/allpolitics/0411/gallery.cabinet/frameset.exclude.html','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=620,height=430');"&gt;Interactive&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/LAW/11/17/peterson/index.html"&gt;Peterson defense seeks new jury&lt;/a&gt; | &lt;img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"&gt;&amp;nbsp;&lt;a href="javascript:LaunchVideo('/law/2004/11/13/dornin.peterson.look.ahead.ktvu.','300k');"&gt;Video&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;b&gt;&lt;span class="cnnBodyText" style="font-weight:bold;"&gt;CNN/Money: &lt;/span&gt;&lt;/b&gt; &lt;a href="/money/2004/11/17/news/fortune500/sears_kmart/index.htm?cnn=yes"&gt;Kmart-Sears in $11 billion deal&lt;/a&gt; | &lt;img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"&gt;&amp;nbsp;&lt;a href="javascript:LaunchVideo('/business/2004/11/17/snow.kmart.sears.merger.cnn.','300k');"&gt;Video&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/US/11/17/nude.newswoman.ap/index.html"&gt;News anchor appears nude&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="cnnMainNewT2"&gt;         &#149;&amp;nbsp;&lt;a href="/2004/SHOWBIZ/Movies/11/17/sexiest.man.reut/index.html"&gt;Jude Law named 'Sexiest Man Alive'&lt;/a&gt;&lt;br&gt; &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;!-- /T2 --&gt;&lt;div&gt;&lt;img src="http://i.cnn.net/cnn/images/1.gif" alt="" width="1" height="10"&gt;&lt;/div&gt;&lt;!-- =========== CNN Radio/Video Box =========== --&gt;&lt;table width="344" border="0" cellpadding="0" cellspacing="0"&gt;&lt;tr&gt;&lt;td colspan="5" bgcolor="#cccccc"&gt;&lt;img src="http://i.a.cnn.net/cnn/images/1.gif" alt="" width="1" height="1" hspace="0" vspace="0"&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="top"&gt;&lt;td width="1" bgcolor="#cccccc"&gt;&lt;img src="http://i.a.cnn.net/cnn/images/1.gif" alt="" width="1" height="1" hspace="0" vspace="0"&gt;&lt;/td&gt;&lt;td width="114"&gt;&lt;div class="cnn6pxPad"&gt;&lt;span class="cnnBigPrint" style="color:#C00;font-weight:bold;"&gt;CNN&lt;/span&gt;&lt;span class="cnnBigPrint" style="color:#000;font-weight:bold;"&gt;RADIO&lt;/span&gt;&lt;div class="cnnMainNewT2"&gt;&lt;a href="javascript:CNN_openPopup('/audio/radio/preferences.html','radioplayer','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,width=200,height=124')"&gt;Latest updates&lt;/a&gt;
indices 22-27 are perfect, but indice 28 is the problem. and yes, that whole garbo of code is all for match #28, and theres lots more matches just like it.

why does it keep on matching past the closing </a> sometimes?

Posted: Thu Nov 18, 2004 2:25 am
by timvw
Because (.*) is "greedy". Try (.*?)

Posted: Thu Nov 18, 2004 2:45 am
by rehfeld
perfect :)

why is/how is .* different from .*? though

i thought .* meant 0 or more of any character
and i thought ? meant for the preceding character to be not required/optional.
so to me it seems like im saying the same thing twice, 0 or more of any character vs possible 0 or more of any character.
i would have thought the "possible" would have been covered by the fact that it could have been 0 occurances of any char.....

sprinkle me :D

Posted: Thu Nov 18, 2004 3:28 am
by timvw
websearch "regular expression greedy matching", first link

http://www.cs.tut.fi/~jkorpela/perl/course.html paragraph "Matching is greedy"

With ? the part after, and with ?? the part before become optional.