Page 1 of 1

Need help fetching content.

Posted: Sun Feb 20, 2005 6:55 am
by pichi
Hi.

I´m using the following code to fetch content from a webiste.
Unfortunatelly, I can make it work.
The website I´m trying to read info from seems to redirect me twice; I can get to the desired page, but got "junk" data insted of the actual content.
Could anyone please help me??

Thank you.

Code: Select all

<?php

class GetWebObject {
var $host = "";
var $port = "";
var $path = "";
var $header = array();
var $content = "";
var $redirect = "";
var $status = "";

function GetWebObject($host, $port, $path)
{
$this->host = $host;
$this->port = $port;
$this->path = $path;
$this->content = "";

$this->status = "x";
while (strpos($this->status,"HTTP/1.0 200 OK") === false and $this->status != '') {
$this->fetch();
}
}

function fetch()
{
$this->redirect = "";
$this->content = "";

$fp = fsockopen ($this->host, $this->port);

if(!$fp) {
die("No puede conectarse con Mercado Libre.");}

$header_done=false;

$request = "GET ".$this->path." HTTP/1.0\r\n";
$request .= "User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)\r\n";
$request .= "Host: ".$this->host."\r\n";
$request .= "Connection: close\r\n\r\n";

fputs ($fp, $request);

$line = fgets ($fp, 128);
$this->status = $this->headerї"status"] = $line;
while (!feof($fp)) {
if($header_done) {
$linea = fread ( $fp, 1024 );
if (strpos($linea,'The URL has moved <a href="') == 0) {
$this->headerї"Location"] = substr($linea,strpos($linea,'The URL has moved <a href="')+strlen('The URL has moved <a href="'),(strpos($linea,'">')-strpos($linea,'The URL has moved <a href="')));
$this->headerї"Location"] = substr($this->headerї"Location"], 0, strpos($this->headerї"Location"], '">'));
break;
} else {
$this->content .= $linea;
}

} else {
$line = fgets ($fp, 128);
if($line == "\r\n") {
$header_done=true;}
else {
$data = explode(": ",$line);
$this->headerї$dataї0]] = $dataї1];
}
}
}

fclose ($fp);

$this->path = $this->headerї"Location"];
}

}

$url = $_GETї'url']; // Trailing slash when not using filename
$url = 'http://www.mercadolibre.com.ar/jm/pms?site=304970&id=2021&as_opt=http://www.mercadolibre.com.ar/jm/item?site=MLA$$id=15502430';

//'http://www.mercadolibre.com.ar/jm/item?site=MLA&id=15502430';
//'http://www.mercadolibre.com.ar/jm/pms?site=304970&id=2021&as_opt=http://www.mercadolibre.com.ar/jm/item?site=MLA$$id=15502430';

$url = substr($url,7);
$host = trim(substr($url,0,strpos($url,'/')));
$path = trim(substr($url,strpos($url,'/')));

$cont = 0;
$file = new GetWebObject($host, 80, $path);

print $file->status;
$lineas = split("\n",$file->content);

for ($i=0; $i<=count($lineas); $i++) {
$buffer = $lineasї$i];

if (strpos($buffer,'comprar y vender en MercadoLibre')) {
$buffer = str_replace('comprar y vender en MercadoLibre','comprar en www.ringtones.com.ar',$buffer);
}

$texto = 'El uso de este sitio';
if (strpos($buffer,$texto)) {
$buffer = str_replace($texto,'<a href=http://www.ringtones.com.ar target=_blank >Ringtones</a><br><br>'.$texto,$buffer);
}

$texto = '/jm/reg';
if (strpos($buffer,$texto)) {
$buffer = str_replace($texto,'http://www.mercadolibre.com.ar/argentina/ml/pms?site=304970&id=2021&go=RG',$buffer);
}

$texto = 'href=/';
if (strpos($buffer,$texto)) {
$buffer = str_replace($texto,'href=http://www.ringtones.com.ar/links.php?url='.urlencode('http://www.mercadolibre.com.ar/'),$buffer);
}

print $buffer;

}

?>
PHENOM | PLEASE USE

Code: Select all

TAGS.

Code: Select all

tags coming soon to a theatre near you[/color][/size]

Posted: Sun Feb 20, 2005 9:13 am
by feyd
I hope you have permission from the site to use their data. :roll: Have you tried doing this with curl? It may help a lot, as it can perform a lot of the work you are doing for you.

Posted: Sun Feb 20, 2005 9:28 am
by smpdawg
Change this line

Code: Select all

if (strpos($linea,'The URL has moved <a href="') == 0) &#123;
to this

Code: Select all

if (strpos($linea,'The URL has moved <a href="') === 0) &#123;
== thinks that 0 and false are the same but in your case they are not so you need to use === (exactly equal) to differentiate between 0 and false.

and later change this

Code: Select all

for ($i=0; $i<=count($lineas); $i++) &#123;
to this

Code: Select all

for ($i=0; $i<count($lineas); $i++) &#123;
And this was just because you were going to an array length of n+1 rather than n.

BTW - Here it is on my site. http://test.php-toolkit.com/fetch/fetch.php

Almost worked...

Posted: Sun Feb 20, 2005 12:56 pm
by pichi
When I use the $_GET['url'] to pass the URL, instead of using a $url constant as I´, using now, it simply doesnt seems to work.

Could you please help me again??

Thank you!!

Posted: Sun Feb 20, 2005 12:58 pm
by feyd
have you checked to make sure the url comes across correctly? it's fairly easy to have it come across wonky..

Posted: Sun Feb 20, 2005 1:16 pm
by smpdawg
You need to pass any URL that you send to the script through urlencode() first.

Without encoding the URL, PHP gets confused and sees the & buried in the URL that you supplied and thinks that it is a new parameter being sent to your script. So any link to the script must encode the URL and once it gets into your script it will be decoded.

So here is an example:

This WON'T work.
http://test.php-toolkit.com/fetch/fetch ... d=15502430

But this will work because the URL is encoded.
http://test.php-toolkit.com/fetch/fetch ... 3D15502430

And all I had to do was this when I made the link.

Code: Select all

$encodedURL = urlencode($URL);
Did that make sense? BTW - Clicking those links will demonstrate the problem and solution.

Sill not working

Posted: Sun Feb 20, 2005 5:29 pm
by pichi
Thank you for your time.

I´ve done what you said, but still got no page.
In fact, I´ve copied the url= part into your site and got the following errors:

Notice: Undefined offset: 1 in /var/www/html/test/fetch/fetch.php on line 62

until timeout stopped.

The url used was ?url=http%3A%2F%2Fwww.mercadolibre.com.ar%2Fargentina%2Fml%2Fpms%3Fsite%3D304970%26id%3D2021%26as_opt%3Dhttp%3A%2F%2Fwww.mercadolibre.com.ar%2Fjm%2Fitem%3Fsite%3DMLA%24%24id%3D15488168

Thank you very much for your time.

Posted: Sun Feb 20, 2005 9:02 pm
by smpdawg
Here is the problem that I located. Your code looped through the header and eventually found the end of the header. Unfortunately the file stream hit EOF and caused your code to dump out of the while and hit the fclose. Then it went back into your main object loop and did another fetch but this time it resumed the HTTP transfer where it left off and started reading the HTML as if it were header data. Does that make sense?

Here is the code that starts the chain reaction.

Code: Select all

if($line == "\are\n") &#123;
$header_done=true;&#125;
else &#123;
$data = explode(": ",$line);
$this->header&#1111;$data&#1111;0]] = $data&#1111;1];
&#125;
Perhaps you can do this. After you set header_done, check to see if eof is true. If it is, just do an fclose, open it back up and let the processing continue.

Thank you again

Posted: Mon Feb 21, 2005 6:34 am
by pichi
I´ve tried what you suggested, but couldn´t make it work.

I would really appreciate if your could please help me solve this!!

Thank you.

Posted: Mon Feb 21, 2005 5:33 pm
by smpdawg
I haven't forgotten about you. I'll take a look at this and get back with you.

Posted: Mon Feb 21, 2005 11:53 pm
by smpdawg
Out of frustration I threw out most of your object and just wrapped it around CURL. Try this out.

Code: Select all

<?php

class GetWebObject &#123;
    var $content = "";
    var $status = "";
    var $location = "";
        
  	function GetWebObject($url, $user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows 9Cool', $proxy = '') &#123;
       $ch = curl_init();
       curl_setopt ($ch, CURLOPT_PROXY, $proxy);
       curl_setopt ($ch, CURLOPT_URL, $url);
       curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
       curl_setopt ($ch, CURLOPT_HEADER, 0);
       curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
       curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
       curl_setopt ($ch, CURLOPT_TIMEOUT, 120);
       $this->content = curl_exec($ch);
       $info = curl_getinfo($ch);
       $this->status = $info&#1111;'http_code'];
       $this->location = $info&#1111;'url'];       
       curl_close($ch);
  	&#125;
&#125;

if (isset($_GET&#1111;'url'])) &#123;
    $url = $_GET&#1111;'url']; // Trailing slash when not using filename
&#125; else &#123;
    $url = 'http://www.mercadolibre.com.ar/argentina/ml/pms?site=304970&id=2021&as_opt=http://www.mercadolibre.com.ar/jm/item?site=MLA$$id=15488168';
&#125;

$cont = 0;
$file = new GetWebObject($url);

echo "&#123;$file->location&#125;<br>";
echo "&#123;$file->status&#125;<br>";

$lineas = split("\n",$file->content);

for ($i=0; $i<count($lineas); $i++) &#123;
    $buffer = $lineas&#1111;$i];
    
    if (strpos($buffer,'comprar y vender en MercadoLibre')) &#123;
        $buffer = str_replace('comprar y vender en MercadoLibre','comprar en www.ringtones.com.ar',$buffer);
    &#125;
    
    $texto = 'El uso de este sitio';
    if (strpos($buffer,$texto)) &#123;
        $buffer = str_replace($texto,'<a href=http://www.ringtones.com.ar target=_blank >Ringtones</a><br><br>'.$texto,$buffer);
    &#125;
    
    $texto = '/jm/reg';
    if (strpos($buffer,$texto)) &#123;
        $buffer = str_replace($texto,'http://www.mercadolibre.com.ar/argentina/ml/pms?site=304970&id=2021&go=RG',$buffer);
    &#125;
    
    $texto = 'href=/';
    if (strpos($buffer,$texto)) &#123;
        $buffer = str_replace($texto,'href=http://www.ringtones.com.ar/links.php?url='.urlencode('http://www.mercadolibre.com.ar/'),$buffer);
    &#125;
 
    print $buffer;
&#125;

?>