Page 1 of 1

PHP: Using remote proxies - Unreliability issues

Posted: Thu May 04, 2006 7:33 am
by krt
** Background Info **

I have a function that fetches content from a page using a random remote proxy. The proxy list is updated daily and there should be no connectivity issues. If a proxy fails $contents returns false and the calling page decides how many times it should retry fetching that page with a different proxy. That is all fine.

** The problem **

The function fails too many times, The calling script reports several "Too many proxies tried" yet a fraction of pages are fetched successfully. Can you suggest a possible cause for this?

Any help will be greatly appreciated and hopefully repaid :)

** The PHP code **

Code: Select all

<?php

function get_new_proxy()
{
    // all you need to know is this function
    // gets a proxy in the format array([URL], [port])
}

// Fetch page, returns content and headers
function fetch($host, $url)
{
    global $dir;
    global $retries;
    static $current_proxy_fetches;
   
    // Attempt to connect to the proxy server to retrieve the remote page
    if (!@$current_proxy_fetches || $current_proxy_fetches++ > 10) {
        $current_proxy_fetches = 0;
        if (!ereg("-noproxy-?", $modifiers))
            list($proxy_address, $proxy_port) = get_new_proxy();
        if (!$socket = @fsockopen($proxy_address, $proxy_port, $errno, $errstring, 20)) {
            $filename = "{$dir['data']}/proxy_blacklist.txt";
            $fp = fopen($filename, 'a+');
            fwrite($fp, date("d/m/y H:i") . " $proxy_address:$proxy_port")
                or log_error("Could not write to file '$filename'");
            fclose($fp);
            $retries++;
            if ($retries < 3) {
                list($_proxy_address, $_proxy_port) = get_new_proxy();
                $contents = fetch($_domain, $_path, $_proxy_address, $_proxy_port);
                return $contents;
            }else{
                $retries = 0;
                return false;
            }
        }
        $current_proxy_fetches++;
    }
   
    // If socket connection successful, reset retries counter
    $retries = 0;

    // HTTP commands
    $headers  = "GET $url HTTP/1.1\r\n";
    $headers .= "Host: $host\r\n";
    $headers .= "Connection: Close\r\n";
    $headers .= "\r\n";
   
    // Init. $contents var
    $contents = "";
   
     // Get the contents
    if ($socket) {
        fwrite($socket, $headers);
       
        while (!feof($socket)){
            $contents .= fgets($socket, 128);
        }
       
        fclose($socket);
    }
   
    /* Contents contains both the html headers and the html of the page. */
    return $contents;
}

?>

Posted: Thu May 04, 2006 7:36 am
by s.dot
hrm, fetching a remote page through a proxy? Seems kinda suspicious.

anyhoo, turn error reporting on (if it isn't already)

Code: Select all

ini_set('display_errors','On');
error_reporting(E_ALL);
and remove the @'s from your script.

look for some errors.

Posted: Thu May 04, 2006 9:22 am
by Roja
As scottayy mentions, for anything automated that fails, run it manually, watch the output, and you'll have your answers.

Turn off error reporting, run it manually a few dozen times, and you'll know where the failures are.

Most likely its due to networking issues, or the (un)reliability of public proxies.

Posted: Thu May 04, 2006 11:24 am
by Chris Corbyn
scottayy wrote:hrm, fetching a remote page through a proxy? Seems kinda suspicious.
Not really... depending upon how your network is configured the only gateway to the internet may be via a proxy ;)