Grab the source of an HTML page and display it

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Grab the source of an HTML page and display it

Post by MinDFreeZ »

Alright.. let me explain the situation....

there's a lot of websites I can't access from work.. and the only important part is the text that's on the page.. so I'm wondering if it's possible to create something that can connect to the website and grab the html source, or just the text of the page and echo it or create a file with it...... I found this on some site, but dont know if it can help me get started... it's supposed to grab the links on a page...

Code: Select all

<?php

$page = 0; 
$URL = "http://www.thewebsite.com/thepage"; 
$page = @fopen($URL, "r"); 
print("Links at $URL<BR>\n"); 
print("<UL>\n"); 
while(!feof($page)) { 
$line = fgets($page, 255); 
while(eregi("HREF=\"[^\"]*\"", $line, $match)) { 
print("<LI>"); 
print($match[0]); 
print("<BR>\n"); 
$replace = ereg_replace("\?", "\?", $match[0]); 
$line = ereg_replace($replace, "", $line); 
} 
} 
print("</UL>\n"); 
fclose($page); 

?>
unless there's some way to set something up on my server that I can be able to use to get to other websites..

anyone know of a way to do this?
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

You would have to write a script that would be on a server that is acceptable at work. That script would have to be able to generated html output of the page(s) you are looking for, then you would have to breakdown the html to a point that you know exists around the segment of text you want. Trash whats before and whats after and you have your text.

You may want to look into the legal aspects of this. This process, while acceptable to some sites, may get you into legal trouble with other. Be careful when using this type of process.
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Post by MinDFreeZ »

yea, the server I would use is acceptable at work... and if possible, the whole page would be good to have.... I would need the structure (most of the sites use tables) and the text inside them... don't care much about the CSS or formatting of the page itself....

it's the script I have a problem with.. I'm not sure how to connect out to another site and grab the html source, from my server..

EDIT: GOT IT!
haha.. it was easy (to find)...

http://us3.php.net/manual/en/function.fopen.php#58099


lol - just used this idea and it works perfect!
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Post by MinDFreeZ »

now being able to login to say.... forums.. or myspace... that I can't get into from work... that would be sweet...

but check it out..

http://enhancedworks.com/test2.php?url= ... yahoo.com/

it works! .... but im assuming that could be dangerous.. that someone else could use any URL ..
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

if you're setting this up for work to bypass some filters or something, I'd suggest looking into setting up a proxy. There's a module for doing just that available for Apache, along with most web server installs.
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Post by MinDFreeZ »

yea I figured... it's just that, here at work... they have some crazy stuff going on.... when I change the proxy settings in IE (the only browser I can use from here) it just doesnt work.. they use some kind of proxy already, for accounts that are allowed to have internet access.....

I've tried setting up an SSL server at home and tunneling through that.. did not work, connection refused through putty from work...

but I think I just figured out a way to alter the script so that each link I click on "my" page... will also use the script...

like if I used the script to view this page, and I clicked on a thread, it would load that into the script and display it.. so that's cool.. just need to figure out how to login to something now =P
(more specifically; forums and myspace)
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you might want to look at the script I posted for Heavy, which sort of acts like a proxy.. but for slightly different purposes.

viewtopic.php?t=29312
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Post by MinDFreeZ »

http://enhancedworks.com/test3.php

doesnt work for me.. maybe its my server.. or maybe I'm doing something wrong.
-I cant look at line 88 from here at work.. I don't have an editor other than wordpad or notepad.
So I cant even figure out what my problem is =P
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Sorry.. there was a bug introduced by the unwinding of previous versions of the php highlighter. I think this fixes it

Code: Select all

<?php

    ini_set( 'display_errors', '1' );
    error_reporting( E_ALL );
    
    if( !defined( '_DEBUG_' ) ) define( '_DEBUG_', 0 );
    
    if(!empty($_GET['url']))
    {
        $get = trim($_GET['url']);
        if(empty($get))
            $_GET['url'] = '';
        elseif( !preg_match('#^[a-zA-Z]{3,}://#',$_GET['url']))
            $_GET['url'] = 'http://' . $_GET['url'];
    }


    $output = "<html>\n\t<head>\n\t\t<title>{PAGE_TITLE}</title>\n\t</head>\n\t<body>\n\t\t<div align=\"center\"><form><input type=\"text\" name=\"url\" value=\"".(isset($_GET['url'])?$_GET['url']:'')."\" size=\"50\"><input type=\"submit\" value=\" get \"></form><div>";
    
    if( !empty($_GET['inline']) && !empty( $_GET['url'] ) && ( $data = @getimagesize( $_GET[ 'url' ] ) ) !== false )
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_NOBODY, 1);
        curl_setopt($ch, CURLOPT_URL, $_GET['url']);
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
        
        $header = '<fieldset><legend style="font-family:sans-serif">&nbsp;Headers&nbsp;</legend><pre style="text-align:left">%s</pre></fieldset>';
        $arr = preg_split("#\n#sS",$raw = curl_exec($ch));
        for($x = 0, $y = sizeof($arr); $x < $y; $x++)
        {
            $arr[$x] = rtrim($arr[$x]);
            if(!empty($arr[$x]) && !isset($found))
                $headers[] = $arr[$x];
            elseif(!empty($arr[$x]))
                $data[] = $arr[$x];
            else
                $found = $x;
        }
        $headers = sprintf($header,htmlentities(implode("\n",$headers),ENT_QUOTES));
        $output .= $headers.'<fieldset><legend style="font-family:sans-serif">&nbsp;Image&nbsp;</legend><img src="' . $_GET['url'] . '" /></fieldset>' . "\n\t\t";
        $output = str_replace('{PAGE_TITLE}', $data['mime'] . ' :: ' . $_GET['url'], $output);
    }
    elseif(!empty($_GET['inline']))
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_URL, $_GET['url']);
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    }
    elseif(!empty($_GET['url']))
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_URL, $_GET['url']);
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

        //$data = file_get_contents( $_GET[ 'url' ] );
        $header = '<fieldset><legend style="font-family:sans-serif">&nbsp;Headers&nbsp;</legend><pre style="text-align:left">%s</pre></fieldset>';
        $arr = preg_split("#\n#sS",$raw = curl_exec($ch));
        for($x = 0, $y = sizeof($arr); $x < $y; $x++)
        {
            $arr[$x] = rtrim($arr[$x]);
            if(!empty($arr[$x]) && !isset($found))
                $headers[] = $arr[$x];
            elseif(!empty($arr[$x]))
                $data[] = $arr[$x];
            else
                $found = $x;
        }
        $headers = sprintf($header,htmlentities(implode("\n",$headers),ENT_QUOTES));
        $data = implode("\n",$data);
        curl_close($ch);
        if(preg_match('#<\s*title.*?>(.*?)<\s*/\s*title.*?>#is',$data,$title))
        {
            $output = str_replace('{PAGE_TITLE}', $_GET['url'] . ' :: ' . $title[1], $output);
        }
        else
        {
            $output = str_replace('{PAGE_TITLE}', 'No page title', $output);
        }
        $urls = array( 'href', 'src', 'action', 'background' );    //    resolve these attributes from the text
        
        $urls = implode( '|', $urls );
        preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?([\'"]?)(.*?)\\2[\s\>]#is', $data, $matches );
        
        $data = htmlentities( $data, ENT_QUOTES );
        
        $site = $_GET[ 'url' ];
        $bits = parse_url( $site );
        $root = $bits[ 'scheme' ] . '://' .
            ( isset( $bits[ 'user' ] ) ? $bits[ 'user' ] : '' ) .
            ( isset( $bits[ 'password' ] ) ? ':' . $bits[ 'password' ] : '' ) .
            ( isset( $bits[ 'user' ] ) ? '@' : '' ) .
            $bits[ 'host' ] .
            ( isset( $bits[ 'port' ] ) ? ':' . $bits[ 'port' ] : '' );
        $path = (isset($bits[ 'path' ]) ? explode( '/', $bits[ 'path' ] ) : array());
        array_pop( $path );
        $path = $root . implode( '/', $path ) . '/';
        
        if( _DEBUG_ )
        $output .= '<div align="left"><pre>' . var_export($matches, true) . '</pre></div>';
            
        
        $pos = 0;
        foreach($matches[0] as $match)
        {
            $url = $matches[ 3 ][ $pos ];
            if(!empty($url))
            {
                list( $left, $right ) = explode( $url, $match );

                $left = htmlentities( $left, ENT_QUOTES );
                $right = htmlentities( $right, ENT_QUOTES );

                //echo $left . $right . "<br />";

                if( preg_match( '#://#', $url ) )
                {    //    full url
                    $newurl = $url;
                }
                elseif( $url{0} == '/' )
                {    //    literal
                    $newurl = $root . $url;
                }
                elseif( preg_match( '#^[a-z0-9_-]+:#i', $url ) )
                {
                    $pos++;
                    continue;
                }
                else
                {
                    $newurl = $path . $url;
                }

                $data = preg_replace( '^' . preg_quote( htmlentities( $match, ENT_QUOTES ), '^' ) .'^', $left . '<a href="' . $_SERVER['SCRIPT_NAME'] . '?url=' . urlencode( $newurl ) .'">' . $url . '</a>' . $right, $data, 1 );
            }
            
            $pos++;
        }
        
        $output .= $headers.'<fieldset><legend style="font-family:sans-serif">&nbsp;HTML&nbsp;</legend><pre style="text-align:left">' . $data . '</pre></fieldset>' . "\n\t\t";
        $output .= '<fieldset><legend style="font-family:sans-serif">&nbsp;Page&nbsp;</legend><iframe src="' . $_SERVER['SCRIPT_NAME'] . '?inline=1&url=' . urlencode($url) . '" width="100%" height="100%">iframe required</iframe></fieldset>';
    }
    else
        $output = str_replace('{PAGE_TITLE}' , 'Welcome to feyd\'s Page Source Browser', $output);

    $output .= "</div>\n\t</body>\n</html>";
    
    echo $output;

?>
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Post by MinDFreeZ »

that's really cool man... good stuff..

Code: Select all

<?php
$siteUrl = "http://enhancedworks.com/test4.php?url="; // (or whatever the url to this script is)
$data = file($url);
$length = count($data);
for ($i=0; $i<$length; $i++) {
$tempString = $data[$i];
$point1 = strpos($tempString, "href=");
if ($point1 != false) {
$part1 = substr($tempString, 0, $point1+6);
$part2 = substr($tempString, $point1+6, strlen($tempString)-strlen($part1));
echo $part1.$siteUrl.$part2;
} else {
echo $tempString;
	}
}
?>
^ that actually worked for me... it also placed the link to the script before each link on the page, so when i clicked a link that i wasnt able to get to, it would load it into the script too.... so all is fine and dandy :P
phpkar
Forum Newbie
Posts: 4
Joined: Sat Feb 18, 2006 6:53 am
Contact:

Grab an HTML page and send by e-mail or display it

Post by phpkar »

This class is meant to fetch a page of a given URL and send it by e-mail as HTML message. The class provides functions to set the message headers From:, To:, Cc:, Bcc: and Subject: . Alternatively, the class may also display the fetched HTML page making it be outputted by the current script.

http://www.php45.com/class.php?id=494
MinDFreeZ
Forum Commoner
Posts: 58
Joined: Tue Feb 14, 2006 12:28 pm
Location: Lake Mary, FL

Post by MinDFreeZ »

sweet stuff man.. good post.
Post Reply