Page 1 of 1

How can I replicate browser results with code?

Posted: Mon Jun 21, 2010 12:58 pm
by korail
Hello,

I have the following code which is designed to visit one of several URLs, save the html code for futher use (parse for various data snippets) then move on to the next one. The problem I'm having is when you manually type in one of the URLs into your browser you get one of 3 different outcomes, but the code only seems to generate one outcome.

First outcome http://tiny.cc/gf07f is what happens when you first enter one of the URL's (you get a login screen). Then if you refresh the URL or enter another similar URL at the same domain you get either this outcome or this outcome. (These URL's have a four digit number on the end which is the number of a locomotive running on the Korean National Railroad, and depending on whether the locomotive is working a train you get this outcome or not you get this outcome).

Now, the problem I'm having is that if I run the code listed below I only get code back for this outcome, not any of the others. The code below is only set-up for testing purposes for one URL at the moment. Does anyone know why I can get either this outcome or this outcome when I manually enter URL's into the browser but only this outcome with the code? Can anyone with more than my complete noob programming skills solve this?

TIA, Damian

Code: Select all

<?php

// RUN TEST


$test_url1 = 'http://logis.korail.go.kr/getcarinfo.do?car_no=7001';
$test_contents1 = file_get_contents($test_url1);
echo strlen ($test_contents1);
echo '<br>';
sleep(5);
$test_url2 = 'http://logis.korail.go.kr/getcarinfo.do?car_no=7002';
$test_contents2 = file_get_contents($test_url2);
echo strlen ($test_contents2);
echo '<br>';


// FLEET TO BE LOOKED UP
$loco[] = "7001";/*
$loco[] = "7002";
$loco[] = "7003";
$loco[] = "7004";
$loco[] = "7005";
$loco[] = "7006";
$loco[] = "7007";
$loco[] = "7008";
$loco[] = "7009";
$loco[] = "7010";
$loco[] = "7011";
$loco[] = "7012";
$loco[] = "7013";
$loco[] = "7014";
$loco[] = "7015";*/


// GET URL DATA
function request_callback($response, $info) {


// WRITE RESPONSE DATA TO FILE
$myFile = "testFile.txt";
$fh = fopen($myFile, 'w') or die("can't open file");
$stringData = $response;
fwrite($fh, $stringData);
fclose($fh);
}


// REQUIRE ROLLING CURL
require("RollingCurl.php");


// POPULATE URL ARRAY
$url_p1 = 'http://logis.korail.go.kr/getcarinfo.do?car_no=';
$url_p2 = '&cntr_no=';
for ($i=0;$i<sizeof($loco);$i++){
$urls[]=$url_p1.$loco[$i].$url_p2;
}


// FETCH URLS
$rc = new RollingCurl("request_callback");
$rc->window_size = 20;
foreach ($urls as $url) {
    $request = new Request($url);
    $rc->add($request);
}
$rc->execute();

?>

Code: Select all

<?php

/*
Authored by Josh Fraser (http://www.joshfraser.com)
Released under Apache License 2.0

Maintained by Alexander Makarov, http://rmcreative.ru/

$Id$
*/

/**
 * Class that represent a single curl request
 */
class Request {
        public $url = false;
        public $method = 'GET';
        public $post_data = null;
        public $headers = null;
        public $options = null;

    /**
     * @param string $url
     * @param string $method
     * @param  $post_data
     * @param  $headers
     * @param  $options
     * @return void
     */
    function __construct($url, $method = "GET", $post_data = null, $headers = null, $options = null) {
        $this->url = $url;
        $this->method = $method;
        $this->post_data = $post_data;
        $this->headers = $headers;
        $this->options = $options;
    }
}

/**
 * RollingCurl custom exception
 */
class RollingCurlException extends Exception {}

/**
 * Class that holds a rolling queue of curl requests.
 *
 * @throws RollingCurlException
 */
class RollingCurl {
    /**
     * @var int
     *
     * Window size is the max number of simultaneous connections allowed.
         *
     * REMEMBER TO RESPECT THE SERVERS:
     * Sending too many requests at one time can easily be perceived
     * as a DOS attack. Increase this window_size if you are making requests
     * to multiple servers or have permission from the receving server admins.
     */
    private $window_size = 5;

    /**
     * @var string|array
     *
     * Callback function to be applied to each result.
     */
    private $callback;

    /**
     * @var array
     *
     * Set your base options that you want to be used with EVERY request.
     */
    protected $options = array(
                CURLOPT_SSL_VERIFYPEER => 0,
        CURLOPT_RETURNTRANSFER => 1,
        CURLOPT_CONNECTTIMEOUT => 30,
        CURLOPT_TIMEOUT => 30
        );
       
    /**
     * @var array
     */
    private $headers = array();

    /**
     * @var Request[]
     *
     * The request queue
     */
    private $requests = array();

    /**
     * @param  $callback
     * Callback function to be applied to each result.
     *
     * Can be specified as 'my_callback_function'
     * or array($object, 'my_callback_method').
     *
     * Function should take two parameters: $response, $info.
     * $response is response body, $info is additional curl info.
     *
     * @return void
     */
        function __construct($callback = null) {
        $this->callback = $callback;
    }

    /**
     * @param string $name
     * @return mixed
     */
    public function __get($name) {
        return (isset($this->{$name})) ? $this->{$name} : null;
    }

    /**
     * @param string $name
     * @param mixed $value
     * @return bool
     */
    public function __set($name, $value){
        // append the base options & headers
        if ($name == "options" || $name == "headers") {
            $this->{$name} = $this->{$name} + $value;
        } else {
            $this->{$name} = $value;
        }
        return true;
    }

    /**
     * Add a request to the request queue
     *
     * @param Request $request
     * @return bool
     */
    public function add($request) {
         $this->requests[] = $request;
         return true;
    }

    /**
     * Create new Request and add it to the request queue
     *
     * @param string $url
     * @param string $method
     * @param  $post_data
     * @param  $headers
     * @param  $options
     * @return bool
     */
    public function request($url, $method = "GET", $post_data = null, $headers = null, $options = null) {
         $this->requests[] = new Request($url, $method, $post_data, $headers, $options);
         return true;
    }

    /**
     * Perform GET request
     *
     * @param string $url
     * @param  $headers
     * @param  $options
     * @return bool
     */
    public function get($url, $headers = null, $options = null) {
        return $this->request($url, "GET", null, $headers, $options);
    }

    /**
     * Perform POST request
     *
     * @param string $url
     * @param  $post_data
     * @param  $headers
     * @param  $options
     * @return bool
     */
    public function post($url, $post_data = null, $headers = null, $options = null) {
        return $this->request($url, "POST", $post_data, $headers, $options);
    }

    /**
     * Execute the curl
     *
     * @param int $window_size Max number of simultaneous connections
     * @return string|bool
     */
    public function execute($window_size = null) {
        // rolling curl window must always be greater than 1
        if (sizeof($this->requests) == 1) {
            return $this->single_curl();
        } else {
            // start the rolling curl. window_size is the max number of simultaneous connections
            return $this->rolling_curl($window_size);
        }
    }

    /**
     * Performs a single curl request
     *
     * @access private
     * @return string
     */
    private function single_curl() {
        $ch = curl_init();              
        $options = $this->get_options(array_shift($this->requests));
        curl_setopt_array($ch,$options);
        $output = curl_exec($ch);
        $info = curl_getinfo($ch);


        // it's not neccesary to set a callback for one-off requests
        if ($this->callback) {
            $callback = $this->callback;
            if (is_callable($this->callback)){
                call_user_func($callback, $output, $info);
            }
        }
                else
            return $output;
    }

    /**
     * Performs multiple curl requests
     *
     * @access private
     * @throws RollingCurlException
     * @param int $window_size Max number of simultaneous connections
     * @return bool
     */
    private function rolling_curl($window_size = null) {
        if ($window_size)
            $this->window_size = $window_size;

        // make sure the rolling window isn't greater than the # of urls
        if (sizeof($this->requests) < $this->window_size)
            $this->window_size = sizeof($this->requests);
       
        if ($this->window_size < 2) {
            throw new RollingCurlException("Window size must be greater than 1");
        }

        $master = curl_multi_init();        

        // start the first batch of requests
        for ($i = 0; $i < $this->window_size; $i++) {
            $ch = curl_init();

            $options = $this->get_options($this->requests[$i]);

            curl_setopt_array($ch,$options);
            curl_multi_add_handle($master, $ch);
        }

        do {
            while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
            if($execrun != CURLM_OK)
                break;
            // a request was just completed -- find out which one
            while($done = curl_multi_info_read($master)) {

                // get the info and content returned on the request
                $info = curl_getinfo($done['handle']);
                $output = curl_multi_getcontent($done['handle']);

                // send the return values to the callback function.
                $callback = $this->callback;
                if (is_callable($callback)){
                    call_user_func($callback, $output, $info);
                }

                // start a new request (it's important to do this before removing the old one)
                if ($i < sizeof($this->requests) && isset($this->requests[$i]) && $i < count($this->requests)) {
                    $ch = curl_init();
                    $options = $this->get_options($this->requests[$i++]);
                    curl_setopt_array($ch,$options);
                    curl_multi_add_handle($master, $ch);
                }

                // remove the curl handle that just completed
                curl_multi_remove_handle($master, $done['handle']);

            }
        } while ($running);
        curl_multi_close($master);
        return true;
    }


    /**
     * Helper function to set up a new request by setting the appropriate options
     *
     * @access private
     * @param Request $request
     * @return array
     */
    private function get_options($request) {
        // options for this entire curl object
        $options = $this->__get('options');
                if (ini_get('safe_mode') == 'Off' || !ini_get('safe_mode')) {
            $options[CURLOPT_FOLLOWLOCATION] = 1;
                        $options[CURLOPT_MAXREDIRS] = 5;
        }
        $headers = $this->__get('headers');

                // append custom options for this specific request
                if ($request->options) {
            $options += $request->options;
        }

                // set the request URL
        $options[CURLOPT_URL] = $request->url;

        // posting data w/ this request?
        if ($request->post_data) {
            $options[CURLOPT_POST] = 1;
            $options[CURLOPT_POSTFIELDS] = $request->post_data;
        }
        if ($headers) {
            $options[CURLOPT_HEADER] = 0;
            $options[CURLOPT_HTTPHEADER] = $headers;
        }

        return $options;
    }

    /**
     * @return void
     */
    public function __destruct() {
        unset($this->window_size, $this->callback, $this->options, $this->headers, $this->requests);
        }
}


?>

Re: How can I replicate browser results with code?

Posted: Mon Jun 21, 2010 2:53 pm
by AbraCadaver
Most likely the site is setting a cookie and your code doesn't accept/provide back a cookie. This should be fairly easy with curl.

Re: How can I replicate browser results with code?

Posted: Mon Jun 21, 2010 3:28 pm
by korail
It may be fairly easy but to a basically non-programmer like me it's relatively difficult. My only experience with cookies comes from either eating them or deleting them. Can anyone give me some tips on where to start? I'm willing to try and learn, but would like to be aimed in the right direction first .