Page 1 of 1

Pull data from website

Posted: Fri Dec 03, 2004 3:06 am
by angelena
I would like to know what keyword should i use to perform a search in this forum for pulling data from another website to be display on my own site.

Thanks ya.

Posted: Fri Dec 03, 2004 3:49 am
by rehfeld
regex

html parse



theres a term for it though, just cant think of it.

Posted: Fri Dec 03, 2004 9:20 am
by EricS
This is assuming you are trying to request a web page and not just a file on a remote server.

1. Your are going to use fsocketopen(); to open a connection to the remote server.

2. Then fputs(); to send the request (requesting whatever page you want) string to the web server.

3. Then while loop with fread() till you hit the end of the file being sent.

I recently wrote a spyder that goes out and parses pages from web sites so I'm pretty familiar with what you are trying to accomplish. I couldn't find any good tutorials to help me out so I had to do all the leg work myself.

You need to become familiar with TCP, this how you will know how to assemble the request string and how to interpret error codes that are returned by the server your contacting.

I will tell you that it's no trivial task and I will post the class that I wrote, and the two classes that it depends on, so you can see what I did. Hopefully that will help you out, considering all the trouble I went through to get this class written.

Web Page Fetcher

Code: Select all

<?php
require_once 'ES_SimpleTimer.php';
require_once 'ES_URLManipulator.php';
class ES_WebPageFetcher {
	function ES_WebPageFetcher() {
		
	}
	function fetch ($webPageURL) {
		$error = false;
		$maxTime = 15;
		$packetSize = 1024;
		$streamTimeOut = 5;
		$urlManipulator =& new ES_URLManipulator();
		$timer =& new ES_SimpleTimer();
		$timer->start();
		// extract domain
		$domain = $urlManipulator->extractDomain($webPageURL);
		// extract protocol
		$protocolessURL = $urlManipulator->removeProtocol($webPageURL);
		// connect to the server
		$socketResource = @fsockopen($domain, 80, $errNo, $errStr, $streamTimeOut);
		if (!is_resource($socketResource)) {
			return NULL;
		}
		// set time out
		stream_set_timeout($socketResource, $streamTimeOut);
		$socketRequest  = 'GET http://'.$protocolessURL.' HTTP/1.0'."\n";
		$socketRequest .= 'Host: '.$domain."\n";
		$socketRequest .= 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'."\n\n";
		fputs($socketResource, $socketRequest);
		// recieve status line
		$statusPattern = '#^http/[\d+]\.[\d+] (\d+)#i';
		$headContent = trim(fgets($socketResource, $packetSize));
		// check timeout
		$streamMetaData = stream_get_meta_data($socketResource);
		if ($streamMetaData['timeout']) {
			fclose($socketResource);
			return NULL;
		}
		// check timer
		if ($timer->fetchRunningTime() > $maxTime) {
			fclose($socketResource);
			return NULL;
		}
		list($statusId) = preg_split($statusPattern, $headContent, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
		$statusPattern = '#200|302#is';
		if (preg_match($statusPattern, $statusId) != 1) {
			fclose($socketResource);
			return NULL;
		}
		// remove the rest of the header
		while(!empty($headContent)) {
			$headContent = trim(fgets($socketResource, $packetSize));
			// check timeout
			$streamMetaData = stream_get_meta_data($socketResource);
			if ($streamMetaData['timeout']) {
				fclose($socketResource);
				return NULL;
			}
			// check timer
			if ($timer->fetchRunningTime() > $maxTime) {
				fclose($socketResource);
				return NULL;
			}
		}
		// get the page contents
		$contents = fgets($socketResource, $packetSize);
		// check timeout
		$streamMetaData = stream_get_meta_data($socketResource);
		if ($streamMetaData['timeout']) {
			fclose($socketResource);
			return NULL;
		}
		// check timer
		if ($timer->fetchRunningTime() > $maxTime) {
			fclose($socketResource);
			return NULL;
		}
		$goodOpeningPattern = '#^<#i';
		if (!preg_match($goodOpeningPattern, $contents)) {
			$contents = '';
		}
		while (!feof($socketResource)) {
			$contents .= fread($socketResource, $packetSize);
			// check timeout
			$streamMetaData = stream_get_meta_data($socketResource);
			if ($streamMetaData['timeout']) {
				fclose($socketResource);
				return NULL;
			}
			// check timer
			if ($timer->fetchRunningTime() > $maxTime) {
				fclose($socketResource);
				return NULL;
			}
		}
		fclose($socketResource);
		return $contents;
	}
}
?>
Simple Timer

Code: Select all

<?php
class ES_SimpleTimer {
	var $startTime;
	var $stopTime;
	function ES_SimpleTimer() {
	}
	function start() {
		list($_usec, $_sec) = explode(" ", microtime());
		$this->startTime = (float)$_usec + (float)$_sec;
	}
	function stop() {
		list($_usec, $_sec) = explode(" ", microtime());
		$this->stopTime = (float)$_usec + (float)$_sec;
	}
	function fetchTime() {
		$totalTime = $this->stopTime - $this->startTime;
		return number_format($totalTime, 5);
	}
	function fetchRunningTime() {
		list($_usec, $_sec) = explode(" ", microtime());
		$currentTime = (float)$_usec + (float)$_sec;
		$elapsedTime = $currentTime - $this->startTime;
		return number_format($elapsedTime, 5);
	}
}
?>
URL Manipulator

Code: Select all

<?php
class ES_URLManipulator {
	function ES_URLManipulator () {
	}
	function extractDomain ($url) {
		// remove protocol
		$pattern = '#http://|https://|ftp://|ftps://|smb://#is';
		$url = preg_replace($pattern, '', $url);
		// extract domain
		$pattern = '#/#is';
		$urlParts = preg_split($pattern, $url, -1, PREG_SPLIT_NO_EMPTY);
		return $urlParts[0];
	}
	function removeProtocol ($url) {
		// remove protocol
		$pattern = '#http://|https://|ftp://|ftps://|smb://#is';
		return preg_replace($pattern, '', $url);
	}
}
?>

Posted: Fri Dec 03, 2004 3:57 pm
by Todd_Z
I'm trying to use your code.... It's returning null here... what can i do to prevent this... if there is anything to be done...

Code: Select all

<?php
        if (!is_resource($socketResource)) {
            return NULL;
        }
?>

Posted: Fri Dec 03, 2004 4:12 pm
by EricS
Okay here is the code in question, you have to look at the line directly above it to see why it's return NULL there.

Code: Select all

<?php
// connect to the server
        $socketResource = @fsockopen($domain, 80, $errNo, $errStr, $streamTimeOut);
        if (!is_resource($socketResource)) {
            return NULL;
        }

?>
I'm creating a socket connection to some server. I have an @ in front of the fsockopen() function because I need the script to continue to run, even if it encounters an error trying to connect. Remove the @ in front of the fsockopen() and see what error is being generate.

You could place some print statments immediately after the fsockopen() line to print the $errNo and $errStr being generate.

Posted: Fri Dec 03, 2004 4:20 pm
by EricS
A couple other things to keep in mind.

1. This script is solely for fetching pages from a web server. It shouldn't be used to open pages on remote servers that aren't being served up by web server software, such as Apache or IIS.

2. It uses port 80 for connecting to a server.

3. I realize this could be accomplished much easier with cURL, but it's not available on the server this script is running on. (I'm working on a cURL version of this script, I will post it when it's finished if people would like to see it.)

Posted: Fri Dec 03, 2004 4:49 pm
by Todd_Z
The site i'm trying is a regular http://www.site.com/ url... is there a specific reason that it could get blocked as happened to me?

Posted: Fri Dec 03, 2004 4:56 pm
by timvw
i think most people just use curl (extension or binary) or file_get_contents

Posted: Fri Dec 03, 2004 5:03 pm
by EricS
Exactly what error are you getting?

There are a few things you can easily modify that might let you though.

On line 28 of ES_WebPageFetcher you will notice the host in the request string is being set to the domain of the page your trying to get. You can change that host to anything you want.

Also. On line 27, you will notice the the protocol being specified is HTTP/1.0 you could try using HTTP/1.1 instead. That comes with it's own downsides though. HTTP/1.1 specification states that you must be abled to handle chunked data. My script doesn't, as hard as I tired to get it to, so some downloads using HTTP/1.1 get kind of slow which is the reason I used 1.0, it just seems to work faster.

Lastly, you could try modifying the User-Agent on line 29, although that's a very common user agent which is why I specified it.

Hope that helps,
EricS

Posted: Fri Dec 03, 2004 5:47 pm
by Todd_Z
I get the errors on line 22... so I'm not even getting there yet :( try out http://www.mechg.com... its a url that i have found won't work... any more ideas? Is there something that people can do to block the connection to their site... curl, fopen, fsockopen all fail

Posted: Fri Dec 03, 2004 6:37 pm
by rehfeld
they could be sniffing the user agent, and denying requests to certain user agents(or lack of a user agent)

you can try a different user agent in the code.

its also possible they are denying http 1.0 requests, but i doubt it.

Posted: Fri Dec 03, 2004 8:11 pm
by Todd_Z
Is it possible to make the headers replicate for example firefox exactly so that the site couldn't tell the difference?

Posted: Fri Dec 03, 2004 8:35 pm
by rehfeld
of course. just send the headers which firefox uses.

you can send as many headers as you like, in the Web Page Fetcher part of the code,

where you see this:

Code: Select all

$socketRequest  = 'GET http://'.$protocolessURL.' HTTP/1.0'."\n";
        $socketRequest .= 'Host: '.$domain."\n";
        $socketRequest .= 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'."\n\n";
just add more headers in. each header must be on its own line.
"\n\n" is used to signify the end of the headers, so youll need to put additional headers before the line which has \n\n

to see which headers a browser is sending, make a new php w/ the following

Code: Select all

<?php

print_r(apache_request_headers());

?>

Posted: Fri Dec 03, 2004 8:40 pm
by Todd_Z

Code: Select all

<?php
$socketResource = @fsockopen("www.mechg.com", 80, $errNo, $errStr, 5);
	if (!is_resource($socketResource))
		echo "Error!";
?>
There we have it... the most simple possible code out of this script that fails. Am i blatantly missing something essential here or what? I can't even get to the point of sending the headers, because it cant connect to the site in the first place...

Posted: Sat Dec 04, 2004 12:19 am
by rehfeld
have you tried that code all by itself, in a sepearate file? if not, then do it.

do other urls work?

maybe you have url wrappers disabled, or maybe 5 seconds isnt long enough for it.

if fopen is failing too, im betting url wrappers are disabled.

use phpinfo(); to find out