Possible to preg_match these domains?
Posted: Fri Oct 22, 2004 1:55 am
Hi there,
I have spent the last 2 days trying to find or build a script that will read a web page from a url, and do a preg_match on all domain types found on the page. I have had very little luck, so ANY help would be excellento.
basically, heres what Im trying to do.
using a form, I want to enter a url, submit, and have the php script fopen the page and go through all the html or text and create a list of urls and domains in that page. But It's been hard making it recognize the difference in urls with or without the http.
so if a page had say 6 text written urls or domains...all different like:
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
domain5.com
123domain.com
I want to list all those just like above... nothing fancy...
Here is some "tryout code" I've tried... Im not a PRO
most of this code is borrowed from other places.
This first one half way does it... it does not convert all the domains and urls. Plus it also returns text and content surrounding the domains and urls.
now this one... works better, but does not match all the domains and urls. only if they have http in them. But like I mentioned above... I need more flexibility.
I guess its all in the regex
So... I need to get these
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
domain5.com
123domain.com
out of a web page and make it print out like this
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
http://domain5.com
http://123domain.com
In other words... I want to be able to scan all my pages from all my sites and create a list of urls that were found.
PLEASE
Give a poor feller a hand...
I sincerely appreciate any help you got.
Thanks in advance.
I have spent the last 2 days trying to find or build a script that will read a web page from a url, and do a preg_match on all domain types found on the page. I have had very little luck, so ANY help would be excellento.
basically, heres what Im trying to do.
using a form, I want to enter a url, submit, and have the php script fopen the page and go through all the html or text and create a list of urls and domains in that page. But It's been hard making it recognize the difference in urls with or without the http.
so if a page had say 6 text written urls or domains...all different like:
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
domain5.com
123domain.com
I want to list all those just like above... nothing fancy...
Here is some "tryout code" I've tried... Im not a PRO
This first one half way does it... it does not convert all the domains and urls. Plus it also returns text and content surrounding the domains and urls.
Code: Select all
<?php
function makeLinks($text)
{
$text = eregi_replace('(((f|ht){1}tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+)', '<a href="\\1">\\1</a>', $text);
$text = eregi_replace('([[]()[{}])(www.[-a-zA-Z0-9@:%_\+.~#?&//=]+)', '\\1<a href="http://\\2">\\2</a>', $text);
return ($text);
}
$the_page = fopen($url, "r");
while(!feof($the_page))
{
$each_line = fgetss($the_page, 80000);
echo makeLinks($each_line);
}
fclose($the_page);
?>now this one... works better, but does not match all the domains and urls. only if they have http in them. But like I mentioned above... I need more flexibility.
Code: Select all
<?php
function instring($String,$Find,$CaseSensitive = false)
{
$i=0;
while (strlen($String)>=$i)
{
unset($substring);
if ($CaseSensitive)
{
$Find=strtolower($Find);
$String=strtolower($String);
}
$substring=substr($String,$i,strlen($Find));
if ($substring==$Find) return true;
$i++;
}
return false;
}
if($url)
{
$html = @implode("",file($url));
@preg_match_all('(((f|ht){1}tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+)', $html, $matches);
for ($i=0; $i< count($matches[0]); $i++)
{
if(instring($matches[0][$i], $find, $CaseSensitive = false))
{
$no=1;
}
else
{
echo $matches[0][$i]."<BR>";
}
}
?>So... I need to get these
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
domain5.com
123domain.com
out of a web page and make it print out like this
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
http://domain5.com
http://123domain.com
In other words... I want to be able to scan all my pages from all my sites and create a list of urls that were found.
PLEASE
I sincerely appreciate any help you got.
Thanks in advance.