Page 1 of 1

Possible to preg_match these domains?

Posted: Fri Oct 22, 2004 1:55 am
by idotcom
Hi there,

I have spent the last 2 days trying to find or build a script that will read a web page from a url, and do a preg_match on all domain types found on the page. I have had very little luck, so ANY help would be excellento.

basically, heres what Im trying to do.

using a form, I want to enter a url, submit, and have the php script fopen the page and go through all the html or text and create a list of urls and domains in that page. But It's been hard making it recognize the difference in urls with or without the http.

so if a page had say 6 text written urls or domains...all different like:

http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
domain5.com
123domain.com

I want to list all those just like above... nothing fancy...

Here is some "tryout code" I've tried... Im not a PRO :roll: most of this code is borrowed from other places. :wink:


This first one half way does it... it does not convert all the domains and urls. Plus it also returns text and content surrounding the domains and urls.

Code: Select all

<?php
function makeLinks($text)
{  
$text = eregi_replace('(((f|ht){1}tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+)',  '<a href="\\1">\\1</a>', $text);
$text = eregi_replace('([[]()[{}])(www.[-a-zA-Z0-9@:%_\+.~#?&//=]+)',    '\\1<a href="http://\\2">\\2</a>', $text);

return ($text);
}

	$the_page = fopen($url, "r");

	while(!feof($the_page))
{
	$each_line = fgetss($the_page, 80000);
	echo makeLinks($each_line);
}
	fclose($the_page);
?>

now this one... works better, but does not match all the domains and urls. only if they have http in them. But like I mentioned above... I need more flexibility.

Code: Select all

<?php
function instring($String,$Find,$CaseSensitive = false) 
{ 
	$i=0; 
	while (strlen($String)>=$i) 
{ 
	unset($substring); 
	if ($CaseSensitive) 
{ 
	$Find=strtolower($Find); 
	$String=strtolower($String); 
} 
	$substring=substr($String,$i,strlen($Find)); 
	if ($substring==$Find) return true; 
	$i++; 
} 
return false; 
}

	if($url)
{

	$html = @implode("",file($url));


	@preg_match_all('(((f|ht){1}tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+)', $html, $matches);

	for ($i=0; $i< count($matches[0]); $i++) 
{

	if(instring($matches[0][$i], $find, $CaseSensitive = false))
{
	$no=1;
}
	else
{

	echo $matches[0][$i]."<BR>";

}
}
?>
I guess its all in the regex :?:

So... I need to get these
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
domain5.com
123domain.com

out of a web page and make it print out like this
http://www.domain.com
http://domain2.net
http://www.domain3.org
http://www.domain4.info
http://domain5.com
http://123domain.com


In other words... I want to be able to scan all my pages from all my sites and create a list of urls that were found.


PLEASE :lol: Give a poor feller a hand...

I sincerely appreciate any help you got.

Thanks in advance.

Posted: Fri Oct 22, 2004 1:18 pm
by rehfeld
thats tough, err, well alot of code.


do you need this to extract it from links like <a href="http://foo.com"> or just plain text, or both?

narrowing it down will GREATLY simplify this.

Posted: Fri Oct 22, 2004 1:58 pm
by idotcom
thanks for your reply...

Well I suppose anything is a start :lol:

I am mostly looking for just text urls and domians, not actual links.

Posted: Fri Oct 22, 2004 2:54 pm
by kettle_drum
Try something like:

Code: Select all

'/((http|https)"\/\/|www)'.'&#1111;a-z0-9\-\._]+\/?&#1111;a-z0-9_\.\-\?\+\/~=&#;,]*'.'&#1111;a-z0-9\/]&#123;1&#125;/si'

Posted: Fri Oct 22, 2004 3:59 pm
by idotcom
kettle_drum
Master

You are a god!!!

I have tried so many expressions... over and over ... with no good result.

This is by far the best. Thank you!

Now I dont mean to push my luck, but is there any way this can be modified to find domains without the www?

if there is a domain like so: domain.com it does not find it.


If not... thats ok... you have saved me so much time already.

Thanks again kettle_drum