Pull data from website
Moderator: General Moderators
Pull data from website
I would like to know what keyword should i use to perform a search in this forum for pulling data from another website to be display on my own site.
Thanks ya.
Thanks ya.
This is assuming you are trying to request a web page and not just a file on a remote server.
1. Your are going to use fsocketopen(); to open a connection to the remote server.
2. Then fputs(); to send the request (requesting whatever page you want) string to the web server.
3. Then while loop with fread() till you hit the end of the file being sent.
I recently wrote a spyder that goes out and parses pages from web sites so I'm pretty familiar with what you are trying to accomplish. I couldn't find any good tutorials to help me out so I had to do all the leg work myself.
You need to become familiar with TCP, this how you will know how to assemble the request string and how to interpret error codes that are returned by the server your contacting.
I will tell you that it's no trivial task and I will post the class that I wrote, and the two classes that it depends on, so you can see what I did. Hopefully that will help you out, considering all the trouble I went through to get this class written.
Web Page Fetcher
Simple Timer
URL Manipulator
1. Your are going to use fsocketopen(); to open a connection to the remote server.
2. Then fputs(); to send the request (requesting whatever page you want) string to the web server.
3. Then while loop with fread() till you hit the end of the file being sent.
I recently wrote a spyder that goes out and parses pages from web sites so I'm pretty familiar with what you are trying to accomplish. I couldn't find any good tutorials to help me out so I had to do all the leg work myself.
You need to become familiar with TCP, this how you will know how to assemble the request string and how to interpret error codes that are returned by the server your contacting.
I will tell you that it's no trivial task and I will post the class that I wrote, and the two classes that it depends on, so you can see what I did. Hopefully that will help you out, considering all the trouble I went through to get this class written.
Web Page Fetcher
Code: Select all
<?php
require_once 'ES_SimpleTimer.php';
require_once 'ES_URLManipulator.php';
class ES_WebPageFetcher {
function ES_WebPageFetcher() {
}
function fetch ($webPageURL) {
$error = false;
$maxTime = 15;
$packetSize = 1024;
$streamTimeOut = 5;
$urlManipulator =& new ES_URLManipulator();
$timer =& new ES_SimpleTimer();
$timer->start();
// extract domain
$domain = $urlManipulator->extractDomain($webPageURL);
// extract protocol
$protocolessURL = $urlManipulator->removeProtocol($webPageURL);
// connect to the server
$socketResource = @fsockopen($domain, 80, $errNo, $errStr, $streamTimeOut);
if (!is_resource($socketResource)) {
return NULL;
}
// set time out
stream_set_timeout($socketResource, $streamTimeOut);
$socketRequest = 'GET http://'.$protocolessURL.' HTTP/1.0'."\n";
$socketRequest .= 'Host: '.$domain."\n";
$socketRequest .= 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'."\n\n";
fputs($socketResource, $socketRequest);
// recieve status line
$statusPattern = '#^http/[\d+]\.[\d+] (\d+)#i';
$headContent = trim(fgets($socketResource, $packetSize));
// check timeout
$streamMetaData = stream_get_meta_data($socketResource);
if ($streamMetaData['timeout']) {
fclose($socketResource);
return NULL;
}
// check timer
if ($timer->fetchRunningTime() > $maxTime) {
fclose($socketResource);
return NULL;
}
list($statusId) = preg_split($statusPattern, $headContent, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$statusPattern = '#200|302#is';
if (preg_match($statusPattern, $statusId) != 1) {
fclose($socketResource);
return NULL;
}
// remove the rest of the header
while(!empty($headContent)) {
$headContent = trim(fgets($socketResource, $packetSize));
// check timeout
$streamMetaData = stream_get_meta_data($socketResource);
if ($streamMetaData['timeout']) {
fclose($socketResource);
return NULL;
}
// check timer
if ($timer->fetchRunningTime() > $maxTime) {
fclose($socketResource);
return NULL;
}
}
// get the page contents
$contents = fgets($socketResource, $packetSize);
// check timeout
$streamMetaData = stream_get_meta_data($socketResource);
if ($streamMetaData['timeout']) {
fclose($socketResource);
return NULL;
}
// check timer
if ($timer->fetchRunningTime() > $maxTime) {
fclose($socketResource);
return NULL;
}
$goodOpeningPattern = '#^<#i';
if (!preg_match($goodOpeningPattern, $contents)) {
$contents = '';
}
while (!feof($socketResource)) {
$contents .= fread($socketResource, $packetSize);
// check timeout
$streamMetaData = stream_get_meta_data($socketResource);
if ($streamMetaData['timeout']) {
fclose($socketResource);
return NULL;
}
// check timer
if ($timer->fetchRunningTime() > $maxTime) {
fclose($socketResource);
return NULL;
}
}
fclose($socketResource);
return $contents;
}
}
?>Code: Select all
<?php
class ES_SimpleTimer {
var $startTime;
var $stopTime;
function ES_SimpleTimer() {
}
function start() {
list($_usec, $_sec) = explode(" ", microtime());
$this->startTime = (float)$_usec + (float)$_sec;
}
function stop() {
list($_usec, $_sec) = explode(" ", microtime());
$this->stopTime = (float)$_usec + (float)$_sec;
}
function fetchTime() {
$totalTime = $this->stopTime - $this->startTime;
return number_format($totalTime, 5);
}
function fetchRunningTime() {
list($_usec, $_sec) = explode(" ", microtime());
$currentTime = (float)$_usec + (float)$_sec;
$elapsedTime = $currentTime - $this->startTime;
return number_format($elapsedTime, 5);
}
}
?>Code: Select all
<?php
class ES_URLManipulator {
function ES_URLManipulator () {
}
function extractDomain ($url) {
// remove protocol
$pattern = '#http://|https://|ftp://|ftps://|smb://#is';
$url = preg_replace($pattern, '', $url);
// extract domain
$pattern = '#/#is';
$urlParts = preg_split($pattern, $url, -1, PREG_SPLIT_NO_EMPTY);
return $urlParts[0];
}
function removeProtocol ($url) {
// remove protocol
$pattern = '#http://|https://|ftp://|ftps://|smb://#is';
return preg_replace($pattern, '', $url);
}
}
?>I'm trying to use your code.... It's returning null here... what can i do to prevent this... if there is anything to be done...
Code: Select all
<?php
if (!is_resource($socketResource)) {
return NULL;
}
?>Okay here is the code in question, you have to look at the line directly above it to see why it's return NULL there.
I'm creating a socket connection to some server. I have an @ in front of the fsockopen() function because I need the script to continue to run, even if it encounters an error trying to connect. Remove the @ in front of the fsockopen() and see what error is being generate.
You could place some print statments immediately after the fsockopen() line to print the $errNo and $errStr being generate.
Code: Select all
<?php
// connect to the server
$socketResource = @fsockopen($domain, 80, $errNo, $errStr, $streamTimeOut);
if (!is_resource($socketResource)) {
return NULL;
}
?>You could place some print statments immediately after the fsockopen() line to print the $errNo and $errStr being generate.
Last edited by EricS on Fri Dec 03, 2004 4:21 pm, edited 1 time in total.
A couple other things to keep in mind.
1. This script is solely for fetching pages from a web server. It shouldn't be used to open pages on remote servers that aren't being served up by web server software, such as Apache or IIS.
2. It uses port 80 for connecting to a server.
3. I realize this could be accomplished much easier with cURL, but it's not available on the server this script is running on. (I'm working on a cURL version of this script, I will post it when it's finished if people would like to see it.)
1. This script is solely for fetching pages from a web server. It shouldn't be used to open pages on remote servers that aren't being served up by web server software, such as Apache or IIS.
2. It uses port 80 for connecting to a server.
3. I realize this could be accomplished much easier with cURL, but it's not available on the server this script is running on. (I'm working on a cURL version of this script, I will post it when it's finished if people would like to see it.)
The site i'm trying is a regular http://www.site.com/ url... is there a specific reason that it could get blocked as happened to me?
Exactly what error are you getting?
There are a few things you can easily modify that might let you though.
On line 28 of ES_WebPageFetcher you will notice the host in the request string is being set to the domain of the page your trying to get. You can change that host to anything you want.
Also. On line 27, you will notice the the protocol being specified is HTTP/1.0 you could try using HTTP/1.1 instead. That comes with it's own downsides though. HTTP/1.1 specification states that you must be abled to handle chunked data. My script doesn't, as hard as I tired to get it to, so some downloads using HTTP/1.1 get kind of slow which is the reason I used 1.0, it just seems to work faster.
Lastly, you could try modifying the User-Agent on line 29, although that's a very common user agent which is why I specified it.
Hope that helps,
EricS
There are a few things you can easily modify that might let you though.
On line 28 of ES_WebPageFetcher you will notice the host in the request string is being set to the domain of the page your trying to get. You can change that host to anything you want.
Also. On line 27, you will notice the the protocol being specified is HTTP/1.0 you could try using HTTP/1.1 instead. That comes with it's own downsides though. HTTP/1.1 specification states that you must be abled to handle chunked data. My script doesn't, as hard as I tired to get it to, so some downloads using HTTP/1.1 get kind of slow which is the reason I used 1.0, it just seems to work faster.
Lastly, you could try modifying the User-Agent on line 29, although that's a very common user agent which is why I specified it.
Hope that helps,
EricS
I get the errors on line 22... so I'm not even getting there yet
try out http://www.mechg.com... its a url that i have found won't work... any more ideas? Is there something that people can do to block the connection to their site... curl, fopen, fsockopen all fail
of course. just send the headers which firefox uses.
you can send as many headers as you like, in the Web Page Fetcher part of the code,
where you see this:
just add more headers in. each header must be on its own line.
"\n\n" is used to signify the end of the headers, so youll need to put additional headers before the line which has \n\n
to see which headers a browser is sending, make a new php w/ the following
you can send as many headers as you like, in the Web Page Fetcher part of the code,
where you see this:
Code: Select all
$socketRequest = 'GET http://'.$protocolessURL.' HTTP/1.0'."\n";
$socketRequest .= 'Host: '.$domain."\n";
$socketRequest .= 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'."\n\n";"\n\n" is used to signify the end of the headers, so youll need to put additional headers before the line which has \n\n
to see which headers a browser is sending, make a new php w/ the following
Code: Select all
<?php
print_r(apache_request_headers());
?>Code: Select all
<?php
$socketResource = @fsockopen("www.mechg.com", 80, $errNo, $errStr, 5);
if (!is_resource($socketResource))
echo "Error!";
?>