PHP CRAWLER: Extract title and meta description from URL

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

Post Reply

Do you believe in this idea?

Yes
9
90%
That depends.... (please describe in reply)
0
No votes
No
1
10%
 
Total votes: 10

User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

PHP CRAWLER: Extract title and meta description from URL

Post by Heavy »

This is not more than an embryo.
An Idea of having one's site automatically crawl referrers for information.
If the site does this and information about it is printed out somewhere near the links, there is a possibility that site developers add a link to my site, to get a link to their site published on my site, without cost, and without marketing.

One could of course make the site send an email to the webmaster first, so he/she can check the referrer link, before permitting the request, in case of the referrer is an obscene site or something else that you would not want to have published.

A simple url is not very exiting, so her goes a method to browse the referrer and extract any occurrance of a title tag or a valid meta description of the site.
Also, the idea could be completed with more functionality, like searching the referrer site for a special description file, or something like a banner, located at a customiseable place.

It sounds good at least in my ears...
A way to get linked to from another site, since anyone would want to have a link to a site that gives a link to you. (at least in theory...)

Sounds a little bit complicated maybe. It took me 15 minutes to explain this to a friend.


Note, the actual intended functionality is not implemented, this is just the code browsing a remote site for info.

Here it goes, it is still very simple and contains a test form:

Code: Select all

<?php
	header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");

	function hostname($strURL){
		preg_match("/\:\/\/([^\/]+)/",$strURL,$arrResult);
		return $arrResult[1];
	}
	
	function ExamineURL($strURL){
		echo "<pre style="background: #fee">";
		$strPage = @file_get_contents($strURL) or ($Error = 1);
		if (!$Error){
			if(preg_match_all("/(?i)\<title\s*\>([^<]+)\<\/title\s*\>|\<meta\s+name=\"description\"\s*content=\"([^\"]+)\"\s*\/?\>/", $strPage, $arrResult)){
				$arrReturn['URL'] = $strURL;
				$arrReturn['URL'] = hostname($strURL);
				
				echo "Examined URL: <a href=\"$strURL\">" . $strURL."</a><br>";
				foreach ($arrResult as $patternOrder => $arrMatches){
					foreach($arrMatches as $strMatch){
						if ($patternOrder == 1 && strlen($strMatch)){
							$arrReturn['title'] = $strMatch;
							echo "Examined title: <b>" . $strMatch ."</b><br>" ;
						}
						if ($patternOrder == 2 && strlen($strMatch)){
							$arrReturn['descr'] = $strMatch;
							echo "Examined description: <b>" . $strMatch ."</b><br>" ;
						}
					}
				}
				return $arrReturn;
			}
		}else{
			echo "Error! Could not open: <a href=\"$strURL\">" . $strURL."</a>";
		}
		echo "</p>";
	}
	
	
	?>
	<form action="<?php echo basename($_SERVER['PHP_SELF'])?>" method="POST">
		<input type="text" name="strURL" value="<?php echo strlen($_POST['strURL']) ? $_POST['strURL'] : "http://"?>">
		<input type="submit" name="submit" value="Test URL"><br>
	</form>
	<?php
	if (isset($_POST['strURL'])){
		print_r(ExamineURL($_POST['strURL']));
	}
	
	if (strlen($_SERVER['HTTP_REFERER'])){
		print_r(ExamineURL($_SERVER['HTTP_REFERER']));
	}
?>
But it is not valid HTML though.
ghost007
Forum Commoner
Posts: 49
Joined: Sat Nov 22, 2003 10:10 am

Post by ghost007 »

This looks a good idea but I'm not sure I understood it all so if you want to give some more information I certainly be interested.

siech
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

If you have page1.php that links to page2.php and you click the link, the browser provides information for page2.php on what page the link was. A referrer URL.

That referrer URL can be used in page2.php to crawl for some basic information regarding page1.php, or whatever URL linked to page2.php.

Someone links to your page, and your page displays info automatically that comes from ouside site. You tell the world that the links you show are automatically set up by the system, when someone links to the page.

There you go, free links.

NOTE:
The code provided is not finished, I just wrote it while playing with the regex.
ghost007
Forum Commoner
Posts: 49
Joined: Sat Nov 22, 2003 10:10 am

Post by ghost007 »

nice :)

will keep a close look to this thread. If I have some more time I will have closer look to your code to see if I can work something out myself.

2 brains always better than one (I hope :) )

siech
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

ghost007 wrote:2 brains always better than one (I hope :) )
Hopefully :lol:
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

I say are 3 enuf
I joined ya too 8)
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

at first I was hesitant about this idea, it soundn't a bit dirty at first, but what your saying is when someone goes to that page, it looks threw the reffer and tells them info about the page they just left.. is that right?

sorry if I am making this more complicated than it needs to be.. from what I do understand, your idea is quite creative.. good job :D
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Yes, you are sort of getting me right.
And about it being dirty:
Yes it can easily get rather dirty, but well implemented, with batch jobs, cache , whatever (tm), it will probably deliver more power than it costs.

I actually haven't implemented it yet... :roll:
Feel free to do whatever you like with the idea.
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

Say, if ya store the info in a db mebbe u can check on who visited when, from where and more such info..
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

but in the end you will be doing pretty much the same like
http://www.phpopentracker.de/en/index.php ?
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

code is not working?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

sorry.. the old post updater kinda chewed up the code.. I fixed it, to where it should be working..
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

yep, now it works. 8)
painperdu
Forum Newbie
Posts: 12
Joined: Fri Mar 04, 2005 4:44 am

Post by painperdu »

I think Heavy is describing an automated link exchange script.
Post Reply