help with php cURL to get data from the site

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
parsek
Forum Newbie
Posts: 4
Joined: Mon Jul 03, 2006 1:14 pm

help with php cURL to get data from the site

Post by parsek »

I was using php cURL successfully to gather data from the website. But one site is returning ‰‹dÈ/åLó²²ó{‡iÚºfzû;Tg`²ÿ ŠsäƒFB¢Ž‘à8Uù–¤e¦„Ò ý¦ðR¢|¬ä2¡1%
jmut
Forum Regular
Posts: 945
Joined: Tue Jul 05, 2005 3:54 am
Location: Sofia, Bulgaria
Contact:

Re: help with php cURL to get data from the site

Post by jmut »

[quote="parsek"]I was using php cURL successfully to gather data from the website. But one site is returning ‰‹dÈ/åLó²²ó{‡iÚºfzû;Tg`²ÿ ŠsäƒFB¢Ž‘à8Uù–¤e¦„Ò ý¦ðR¢|¬ä2¡1%
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

By the looks of things, the server you're getting a page from is using mod_gzip .. or some other compression system. You'll need to set the http-accept header to only use text/html.

This won't be a lot of help .. but what the hell! A while ago I started writing a cURL abstraction class .. it's not finished, but it works in it's current state:

Code: Select all

class CurlAbs {
	
		var $version;
		var $baseURL;
		var $cookie;
		var $cookieexpirytime;
		var $agent;
		var $debug;
	
		function CurlAbs($baseURL,$ua="",$debug=0) {
			$this->version = 0.1;
			$this->baseURL = $baseURL;
			$this->cookie = getcwd()."/cookie.txt";
			$this->cookieexpirytime = 3600; //1 hour
			$this->agent = $this->set_agent($ua);
			$this->debug = $debug;
			$this->_debug("CurlAbs initiated");
		}

		function curl($url,$post="") {
			if (!$this->curl) { $this->curl = curl_init(); }
			$curl = $this->curl;
			curl_setopt ($curl, CURLOPT_URL, $this->baseURL.$url);
			if (is_array($post)) {
				curl_setopt ($curl, CURLOPT_POST, TRUE);
				curl_setopt ($curl, CURLOPT_POSTFIELDS, $post);
			} else {
				curl_setopt ($curl, CURLOPT_POST, FALSE);
			}
			curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
			curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, 1);
			curl_setopt ($curl, CURLOPT_USERAGENT, $this->agent); 
			curl_setopt ($curl, CURLOPT_FOLLOWLOCATION, 1);
			curl_setopt ($curl, CURLOPT_COOKIEFILE, $this->cookie);
			curl_setopt ($curl, CURLOPT_COOKIEJAR, $this->cookie);
			$page = curl_exec($curl);
			if (curl_errno($curl)) { echo curl_error($curl); }
			return($page);
		}
		
		function close() {
			if ($this->curl) { curl_close($this->curl); }
		}
		
		function set_agent($ua="") {
			switch ($ua) {
				case "NS6": return "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.2) Gecko/20020508 Netscape6/6.1"; break; //NS6
				case "NS7": return "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)"; break; //NS7
				case "IE6": return "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"; break; //IE6
				case "IE7": return "Mozilla/4.0 (compatible; MSIE 7.0b; Win32)"; break; //IE7
				case "FF1": return "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6"; break; //FF1
				case "FF15": return "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8) Gecko/20051107 Firefox/1.5"; break; //FF15
				case "OP8": return "Opera/8.00 (Windows NT 5.1; U; en)"; break; //OP8
				case "OP9": return "Opera/9.00 (Macintosh; PPC Mac OS X; U; en)"; break; //OP9
				case "SA2": return "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/412 (KHTML, like Gecko) Safari/412"; break; //SA2
				case "CA1": return "Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.1) Gecko/20060214 Camino/1.0"; break; //CA1
				case "LYNX": return "Lynx/2.8.4rel.1 libwww-FM/2.14"; break; //LYNX
				default: return "CurlAbs/".$this->version." (PHP/cURL)"; break; //PHLOPPY
			}
		}
		
		function get_agent() {
			return $this->agent;
		}
		
		function _debug($message="") {
			if ($this->debug) {
				echo $message."<br>";
				flush();
			}
		}

	}
Example usage:

Code: Select all

$c = new CurlAbs("http://www.ooer.com/","FF15",1);
$page = $c->curl("index.php");
$c->close();
echo $page;
It's purely an abstraction class .. the idea is to extend it with another class on top. For example, I've written a class to interact with IPB forum boards on top of it. Unfortunately there's quite a lot missing, like the ability to control more headers (encoding and language specifically..).
parsek
Forum Newbie
Posts: 4
Joined: Mon Jul 03, 2006 1:14 pm

solve one problem, getting another

Post by parsek »

I found where the problem was.
I comment out the line
$header_array[..] = "Accept-Encoding: gzip,deflate";
and now reading the data.

But here another thing:
On the status bar of the browser it loads with error and looks like waiting to load, while page is actually loaded.
In firefox it is flashing between "done" and "waiting for http://www.mysite.com"
In IE it says load with error and sandglass appears.

Do you know what might be causing this?
Post Reply