Page 1 of 1

Grab the source of an HTML page and display it

Posted: Tue Feb 14, 2006 1:13 pm
by MinDFreeZ
Alright.. let me explain the situation....

there's a lot of websites I can't access from work.. and the only important part is the text that's on the page.. so I'm wondering if it's possible to create something that can connect to the website and grab the html source, or just the text of the page and echo it or create a file with it...... I found this on some site, but dont know if it can help me get started... it's supposed to grab the links on a page...

Code: Select all

<?php

$page = 0; 
$URL = "http://www.thewebsite.com/thepage"; 
$page = @fopen($URL, "r"); 
print("Links at $URL<BR>\n"); 
print("<UL>\n"); 
while(!feof($page)) { 
$line = fgets($page, 255); 
while(eregi("HREF=\"[^\"]*\"", $line, $match)) { 
print("<LI>"); 
print($match[0]); 
print("<BR>\n"); 
$replace = ereg_replace("\?", "\?", $match[0]); 
$line = ereg_replace($replace, "", $line); 
} 
} 
print("</UL>\n"); 
fclose($page); 

?>
unless there's some way to set something up on my server that I can be able to use to get to other websites..

anyone know of a way to do this?

Posted: Tue Feb 14, 2006 1:28 pm
by RobertGonzalez
You would have to write a script that would be on a server that is acceptable at work. That script would have to be able to generated html output of the page(s) you are looking for, then you would have to breakdown the html to a point that you know exists around the segment of text you want. Trash whats before and whats after and you have your text.

You may want to look into the legal aspects of this. This process, while acceptable to some sites, may get you into legal trouble with other. Be careful when using this type of process.

Posted: Tue Feb 14, 2006 1:33 pm
by MinDFreeZ
yea, the server I would use is acceptable at work... and if possible, the whole page would be good to have.... I would need the structure (most of the sites use tables) and the text inside them... don't care much about the CSS or formatting of the page itself....

it's the script I have a problem with.. I'm not sure how to connect out to another site and grab the html source, from my server..

EDIT: GOT IT!
haha.. it was easy (to find)...

http://us3.php.net/manual/en/function.fopen.php#58099


lol - just used this idea and it works perfect!

Posted: Tue Feb 14, 2006 8:12 pm
by MinDFreeZ
now being able to login to say.... forums.. or myspace... that I can't get into from work... that would be sweet...

but check it out..

http://enhancedworks.com/test2.php?url= ... yahoo.com/

it works! .... but im assuming that could be dangerous.. that someone else could use any URL ..

Posted: Tue Feb 14, 2006 8:36 pm
by feyd
if you're setting this up for work to bypass some filters or something, I'd suggest looking into setting up a proxy. There's a module for doing just that available for Apache, along with most web server installs.

Posted: Tue Feb 14, 2006 9:06 pm
by MinDFreeZ
yea I figured... it's just that, here at work... they have some crazy stuff going on.... when I change the proxy settings in IE (the only browser I can use from here) it just doesnt work.. they use some kind of proxy already, for accounts that are allowed to have internet access.....

I've tried setting up an SSL server at home and tunneling through that.. did not work, connection refused through putty from work...

but I think I just figured out a way to alter the script so that each link I click on "my" page... will also use the script...

like if I used the script to view this page, and I clicked on a thread, it would load that into the script and display it.. so that's cool.. just need to figure out how to login to something now =P
(more specifically; forums and myspace)

Posted: Tue Feb 14, 2006 9:18 pm
by feyd
you might want to look at the script I posted for Heavy, which sort of acts like a proxy.. but for slightly different purposes.

viewtopic.php?t=29312

Posted: Tue Feb 14, 2006 9:49 pm
by MinDFreeZ
http://enhancedworks.com/test3.php

doesnt work for me.. maybe its my server.. or maybe I'm doing something wrong.
-I cant look at line 88 from here at work.. I don't have an editor other than wordpad or notepad.
So I cant even figure out what my problem is =P

Posted: Tue Feb 14, 2006 11:13 pm
by feyd
Sorry.. there was a bug introduced by the unwinding of previous versions of the php highlighter. I think this fixes it

Code: Select all

<?php

    ini_set( 'display_errors', '1' );
    error_reporting( E_ALL );
    
    if( !defined( '_DEBUG_' ) ) define( '_DEBUG_', 0 );
    
    if(!empty($_GET['url']))
    {
        $get = trim($_GET['url']);
        if(empty($get))
            $_GET['url'] = '';
        elseif( !preg_match('#^[a-zA-Z]{3,}://#',$_GET['url']))
            $_GET['url'] = 'http://' . $_GET['url'];
    }


    $output = "<html>\n\t<head>\n\t\t<title>{PAGE_TITLE}</title>\n\t</head>\n\t<body>\n\t\t<div align=\"center\"><form><input type=\"text\" name=\"url\" value=\"".(isset($_GET['url'])?$_GET['url']:'')."\" size=\"50\"><input type=\"submit\" value=\" get \"></form><div>";
    
    if( !empty($_GET['inline']) && !empty( $_GET['url'] ) && ( $data = @getimagesize( $_GET[ 'url' ] ) ) !== false )
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_NOBODY, 1);
        curl_setopt($ch, CURLOPT_URL, $_GET['url']);
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
        
        $header = '<fieldset><legend style="font-family:sans-serif">&nbsp;Headers&nbsp;</legend><pre style="text-align:left">%s</pre></fieldset>';
        $arr = preg_split("#\n#sS",$raw = curl_exec($ch));
        for($x = 0, $y = sizeof($arr); $x < $y; $x++)
        {
            $arr[$x] = rtrim($arr[$x]);
            if(!empty($arr[$x]) && !isset($found))
                $headers[] = $arr[$x];
            elseif(!empty($arr[$x]))
                $data[] = $arr[$x];
            else
                $found = $x;
        }
        $headers = sprintf($header,htmlentities(implode("\n",$headers),ENT_QUOTES));
        $output .= $headers.'<fieldset><legend style="font-family:sans-serif">&nbsp;Image&nbsp;</legend><img src="' . $_GET['url'] . '" /></fieldset>' . "\n\t\t";
        $output = str_replace('{PAGE_TITLE}', $data['mime'] . ' :: ' . $_GET['url'], $output);
    }
    elseif(!empty($_GET['inline']))
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_URL, $_GET['url']);
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    }
    elseif(!empty($_GET['url']))
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_URL, $_GET['url']);
        curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

        //$data = file_get_contents( $_GET[ 'url' ] );
        $header = '<fieldset><legend style="font-family:sans-serif">&nbsp;Headers&nbsp;</legend><pre style="text-align:left">%s</pre></fieldset>';
        $arr = preg_split("#\n#sS",$raw = curl_exec($ch));
        for($x = 0, $y = sizeof($arr); $x < $y; $x++)
        {
            $arr[$x] = rtrim($arr[$x]);
            if(!empty($arr[$x]) && !isset($found))
                $headers[] = $arr[$x];
            elseif(!empty($arr[$x]))
                $data[] = $arr[$x];
            else
                $found = $x;
        }
        $headers = sprintf($header,htmlentities(implode("\n",$headers),ENT_QUOTES));
        $data = implode("\n",$data);
        curl_close($ch);
        if(preg_match('#<\s*title.*?>(.*?)<\s*/\s*title.*?>#is',$data,$title))
        {
            $output = str_replace('{PAGE_TITLE}', $_GET['url'] . ' :: ' . $title[1], $output);
        }
        else
        {
            $output = str_replace('{PAGE_TITLE}', 'No page title', $output);
        }
        $urls = array( 'href', 'src', 'action', 'background' );    //    resolve these attributes from the text
        
        $urls = implode( '|', $urls );
        preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?([\'"]?)(.*?)\\2[\s\>]#is', $data, $matches );
        
        $data = htmlentities( $data, ENT_QUOTES );
        
        $site = $_GET[ 'url' ];
        $bits = parse_url( $site );
        $root = $bits[ 'scheme' ] . '://' .
            ( isset( $bits[ 'user' ] ) ? $bits[ 'user' ] : '' ) .
            ( isset( $bits[ 'password' ] ) ? ':' . $bits[ 'password' ] : '' ) .
            ( isset( $bits[ 'user' ] ) ? '@' : '' ) .
            $bits[ 'host' ] .
            ( isset( $bits[ 'port' ] ) ? ':' . $bits[ 'port' ] : '' );
        $path = (isset($bits[ 'path' ]) ? explode( '/', $bits[ 'path' ] ) : array());
        array_pop( $path );
        $path = $root . implode( '/', $path ) . '/';
        
        if( _DEBUG_ )
        $output .= '<div align="left"><pre>' . var_export($matches, true) . '</pre></div>';
            
        
        $pos = 0;
        foreach($matches[0] as $match)
        {
            $url = $matches[ 3 ][ $pos ];
            if(!empty($url))
            {
                list( $left, $right ) = explode( $url, $match );

                $left = htmlentities( $left, ENT_QUOTES );
                $right = htmlentities( $right, ENT_QUOTES );

                //echo $left . $right . "<br />";

                if( preg_match( '#://#', $url ) )
                {    //    full url
                    $newurl = $url;
                }
                elseif( $url{0} == '/' )
                {    //    literal
                    $newurl = $root . $url;
                }
                elseif( preg_match( '#^[a-z0-9_-]+:#i', $url ) )
                {
                    $pos++;
                    continue;
                }
                else
                {
                    $newurl = $path . $url;
                }

                $data = preg_replace( '^' . preg_quote( htmlentities( $match, ENT_QUOTES ), '^' ) .'^', $left . '<a href="' . $_SERVER['SCRIPT_NAME'] . '?url=' . urlencode( $newurl ) .'">' . $url . '</a>' . $right, $data, 1 );
            }
            
            $pos++;
        }
        
        $output .= $headers.'<fieldset><legend style="font-family:sans-serif">&nbsp;HTML&nbsp;</legend><pre style="text-align:left">' . $data . '</pre></fieldset>' . "\n\t\t";
        $output .= '<fieldset><legend style="font-family:sans-serif">&nbsp;Page&nbsp;</legend><iframe src="' . $_SERVER['SCRIPT_NAME'] . '?inline=1&url=' . urlencode($url) . '" width="100%" height="100%">iframe required</iframe></fieldset>';
    }
    else
        $output = str_replace('{PAGE_TITLE}' , 'Welcome to feyd\'s Page Source Browser', $output);

    $output .= "</div>\n\t</body>\n</html>";
    
    echo $output;

?>

Posted: Fri Feb 17, 2006 9:22 pm
by MinDFreeZ
that's really cool man... good stuff..

Code: Select all

<?php
$siteUrl = "http://enhancedworks.com/test4.php?url="; // (or whatever the url to this script is)
$data = file($url);
$length = count($data);
for ($i=0; $i<$length; $i++) {
$tempString = $data[$i];
$point1 = strpos($tempString, "href=");
if ($point1 != false) {
$part1 = substr($tempString, 0, $point1+6);
$part2 = substr($tempString, $point1+6, strlen($tempString)-strlen($part1));
echo $part1.$siteUrl.$part2;
} else {
echo $tempString;
	}
}
?>
^ that actually worked for me... it also placed the link to the script before each link on the page, so when i clicked a link that i wasnt able to get to, it would load it into the script too.... so all is fine and dandy :P

Grab an HTML page and send by e-mail or display it

Posted: Sat Feb 18, 2006 8:26 am
by phpkar
This class is meant to fetch a page of a given URL and send it by e-mail as HTML message. The class provides functions to set the message headers From:, To:, Cc:, Bcc: and Subject: . Alternatively, the class may also display the fetched HTML page making it be outputted by the current script.

http://www.php45.com/class.php?id=494

Posted: Sat Feb 18, 2006 9:39 am
by MinDFreeZ
sweet stuff man.. good post.