Page 1 of 1

preg_match to find urls on page

Posted: Wed Dec 08, 2004 6:18 pm
by josh
uhh... The preg match special characters confuse me

First of all I'm sure this has been asked before, I couldn't find anything and I'm in a hurry to get this done

(The show 'lost' comes on soon heh)

Code: Select all

<?php
$html=file_get_contents($url);
$html = preg_match_all("/(http:\/\/(.*))[\s]*/", $html, $matches);
// I know it's print_r, print_arr is a function I made that color codes the array
print_arr($matches);
?>
It's supposed to return an array of all the URL's on any given page, for some reason it's going crazy... try it out you'll see

I suck with all these special characters heh, anyone know what I'm doing wrong?


EDIT... seems to work on a few pages i try but not in all cases, try to run it on http://www.google.com/search?hl=en&q=te ... gle+Search for example

Posted: Wed Dec 08, 2004 6:30 pm
by rehfeld
for starters replace (.*) w/ (.*?)

currently its being "greedy", where it tries to match the largest possible matches
you want it to match the shortest possible match

theres lots of things you will need to change though,
that pattern is far too simple to be effective

this place helped me w/ regex immensly

http://www.regular-expressions.info/tutorial.html

Posted: Wed Dec 08, 2004 6:42 pm
by josh
Thank you