Explode but not really - Looking to pull out tags

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
waradmin
Forum Contributor
Posts: 240
Joined: Fri Nov 04, 2005 2:57 pm

Explode but not really - Looking to pull out tags

Post by waradmin »

I am trying to code a little script that goes through the HTML source of some of my pages, and pulls out the images and displays just the images. So essentially I just want to capture the content between the <img /> tag, however I am having problems with this.

The dummy junk code I came up with

Code: Select all

$partA = explode("<img src=", $result);
$count = count($partA);
for($i=0;$i<$count;$i++)
{
	$tmp = $partA[$i];
	$partB = explode("/>", $tmp);
	$img = $partB[0];
	$img = str_replace("'", "", $img);
	echo "<img src=$img>"; 
}
It displays the images however it also ends up echoing text. Is there a better way to do this, IE have php search a string for, say [/b]<img[/b] and capture the values until it reaches the end of an extension like .gif and then store just that into an array?

Sorry if its confusing, thanks for any help provided!

-Steve
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Your better off using regular expression to capture all the img src contents. Untested.

Code: Select all

preg_match_all('/<img.*?src="([^"]+)/im', $htmlsource, $matches);

echo '<pre>';
print_r($matches);
echo '</pre>';
User avatar
waradmin
Forum Contributor
Posts: 240
Joined: Fri Nov 04, 2005 2:57 pm

Post by waradmin »

Alright cool that makes a lot of sense. Now it is semi working, however it is displaying this after the image tags:

Code: Select all

%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
So it is like:

Code: Select all

http://www.someserver.com/images/image.png%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
Any idea why its doing this?
User avatar
Jonah Bron
DevNet Master
Posts: 2764
Joined: Thu Mar 15, 2007 6:28 pm
Location: Redding, California

Post by Jonah Bron »

Hmmm. I've been thinking about a symilar process for getting <a> tags, for an index. Maybe this?

Code: Select all

<?php
echo "<html><head><title>Only pictures</title></head><body>\n";
$document = file_get_contents($_SERVER['HTTP_SELF);//get current file contents
$document = explode('<img src="', $document);//explode at <img src="
$new_doc = array();//create var for new doc output
for ($i=1;$i<count($document);$i++){//skip first part of $document, becuase it is before the first <img src="
  $new_doc[$i-1] = $document[$i];//add a new part to new document, with exploded img, without the first part
}
foreach ($new_doc as $sub_doc){//above was so we could just use foreach now.  loop through each seperate part of new document
  $sub_doc = explode('"', $sub_doc);//explode sub document, so it is now a nested array.
}
/*now, each part of the new document is $new_document[eachpart][0]. below, output all pictures:*/
foreach ($new_doc as $sub_doc){//loop through each doc
  echo $sub_doc[0] ."<br />\n";//output each one
}
die("</body></html>");//don't show the rest of the page.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

waradmin wrote:Alright cool that makes a lot of sense. Now it is semi working, however it is displaying this after the image tags:

Code: Select all

%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
So it is like:

Code: Select all

http://www.someserver.com/images/image.png%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
Any idea why its doing this?
I'm not exactly sure what you mean. Can you post an example source?
PHPyoungster wrote:Hmmm. I've been thinking about a symilar process for getting <a> tags, for an index. Maybe this?

Code: Select all

<?php
echo "<html><head><title>Only pictures</title></head><body>\n";
$document = file_get_contents($_SERVER['HTTP_SELF);//get current file contents
$document = explode('<img src="', $document);//explode at <img src="
$new_doc = array();//create var for new doc output
for ($i=1;$i<count($document);$i++){//skip first part of $document, becuase it is before the first <img src="
  $new_doc[$i-1] = $document[$i];//add a new part to new document, with exploded img, without the first part
}
foreach ($new_doc as $sub_doc){//above was so we could just use foreach now.  loop through each seperate part of new document
  $sub_doc = explode('"', $sub_doc);//explode sub document, so it is now a nested array.
}
/*now, each part of the new document is $new_document[eachpart][0]. below, output all pictures:*/
foreach ($new_doc as $sub_doc){//loop through each doc
  echo $sub_doc[0] ."<br />\n";//output each one
}
die("</body></html>");//don't show the rest of the page.
Any particular reason not to use regular expression? It makes processing a lot simpler :)
User avatar
waradmin
Forum Contributor
Posts: 240
Joined: Fri Nov 04, 2005 2:57 pm

Post by waradmin »

Alright, the code posted second just displays a bunch of h on the screen.

As for the code that was being buggy:

Code: Select all

<?php
function getRemoteFile($url)
{
   // get the host name and url path
   $parsedUrl = parse_url($url);
   $host = $parsedUrl['host'];
   if (isset($parsedUrl['path'])) {
      $path = $parsedUrl['path'];
   } else {
      $path = '/';
   }

   if (isset($parsedUrl['query'])) {
      $path .= '?' . $parsedUrl['query'];
   } 

   if (isset($parsedUrl['port'])) {
      $port = $parsedUrl['port'];
   } else {
      // most sites use port 80
      $port = '80';
   }

   $timeout = 10;
   $response = '';
   // connect to the remote server 
   $fp = @fsockopen($host, '80', $errno, $errstr, $timeout );

   if( !$fp ) { 
      echo "Cannot retrieve $url";
   } else {
      // send the necessary headers to get the file 
      fputs($fp, "GET $path HTTP/1.0\r\n" .
                 "Host: $host\r\n" .
                 "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3\r\n" .
                 "Accept: */*\r\n" .
                 "Accept-Language: en-us,en;q=0.5\r\n" .
                 "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n" .
                 "Keep-Alive: 300\r\n" .
                 "Connection: keep-alive\r\n" .
                 "Referer: http://$host\r\n\r\n");

      // retrieve the response from the remote server 
      while ( $line = fread( $fp, 4096 ) ) { 
         $response .= $line;
      }

      fclose( $fp );

      // strip the headers
      $pos      = strpos($response, "\r\n\r\n");
      $response = substr($response, $pos + 4);
   }

   // return the file content 
   return $response;
}
$result = getRemoteFile($_GET['path']);

preg_match_all("/<img.*?src=\"([^\"]+)/im", $result, $matches); 

echo '<pre>'; 
print_r($matches); 
echo '</pre>';
?>
Outputs:

Code: Select all

Array
(
    [0] => Array
        (
            [0] =>           Array
        (
            [0] => images/logo_top.png
            [1] => images/buttons/home_on.png
            [2] => images/buttons/about.png
            [3] => images/buttons/members.png
            [4] => images/buttons/register.png
            [5] => images/buttons/search.png
            [6] => images/buttons/statistics.png
            [7] => images/buttons/contact.png
            [8] => images/buttons/tutorial.png
            [9] => images/buttons/forums.png
            [10] => images/mysql_powered.png
            [11] => images/pwrd_apache.gif
            [12] => images/icon_mini-xml.png
            [13] => images/php-power-micro2.png
            [14] => images/linux_powered.gif
            [15] => images/loadtime.gif
            [16] => images/queries.gif
            [17] => images/gzip.gif
            [18] => images/icon_user.gif
            [19] => images/load.gif
        )

)
Note here is the real output I am getting (screenshot because i cant copy and paste dead images):
Image
(Note the blue question mark boxes are the images that are not displaying, thats how Safari displays dead images)

And the "image source" for the images that are not displaying are like:

Code: Select all

http://localhost/~steve/images/logo_top.png%20%20%20%20%20%20%20%20%20%20%20%20[1]%20=%3E%20%3Cimg%20src=
Where it should be:

Code: Select all

mages/logo_top.png
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

It is capturing the image contents just fine.. I don't get what you are trying to say? The reason why the images are not showing on on your output and appearing broken because they are relative urls. If you want to actually display the images you'll need to preprend the domain name to the source.
User avatar
waradmin
Forum Contributor
Posts: 240
Joined: Fri Nov 04, 2005 2:57 pm

Post by waradmin »

What I dont get is why there are 10 images like: "http://localhost/~steve/images/logo_top ... img%20src=" and the remainder do not look like that.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Can you post a link or the source code itself -- of the site you are scraping and not your own code.
User avatar
waradmin
Forum Contributor
Posts: 240
Joined: Fri Nov 04, 2005 2:57 pm

Post by waradmin »

Just to be clear I am doing this on my own sites, here is the code:

Code: Select all

<?php
include('top.php');
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>RapidHash - v2 [-UNFINISHED BETA-]</title>
<link rel="stylesheet" href="style.css" type="text/css" />
<link rel="stylesheet" href="includes/tooltips.css" type="text/css" media="screen" />
</head>
<body topmargin="0" leftmargin="0" rightmargin="0">
<table width="100%" border="0" cellpadding="0" cellspacing="0" align="center">
<tr><td><div align="center"><img src="images/logo_top.png"></div></td></tr>
</table>
 <table width="750" border="0" cellspacing="0" cellpadding="0" background="images/buttons/bg.png" align="center">
    <tr><td>
    <div align="center">
    <a href="?page=home"><? if(($_GET['page'] == "") || ($_GET['page'] == "home")) { echo "<img src=\"images/buttons/home_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/home.png\" border=\"0\">"; } ?></a>
    <a href="?page=about"><? if($_GET['page'] == "about") { echo "<img src=\"images/buttons/about_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/about.png\" border=\"0\">"; } ?></a>
    <a href="?page=members"><? if($_GET['page'] == "members") { echo "<img src=\"images/buttons/members_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/members.png\" border=\"0\">"; } ?></a>
    <?php if($_SESSION['uname'] == "") { ?><a href="?page=register"><? if($_GET['page'] == "register") { echo "<img src=\"images/buttons/register_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/register.png\" border=\"0\">"; } ?></a><?php } ?>
    <?php if($_SESSION['uname'] != "") { ?><a href="?page=logout"><? if($_GET['page'] == "logout") { echo "<img src=\"images/buttons/logout_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/logout.png\" border=\"0\">"; } ?></a><?php } ?>
    <a href="?page=search"><? if($_GET['page'] == "search") { echo "<img src=\"images/buttons/search_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/search.png\" border=\"0\">"; } ?></a>
    <a href="?page=stats"><? if($_GET['page'] == "stats") { echo "<img src=\"images/buttons/statistics_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/statistics.png\" border=\"0\">"; } ?></a>
    <a href="?page=contact"><? if($_GET['page'] == "contact") { echo "<img src=\"images/buttons/contact_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/contact.png\" border=\"0\">"; } ?></a>
    <a href="?page=tutorial"><? if($_GET['page'] == "tutorial") { echo "<img src=\"images/buttons/tutorial_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/tutorial.png\" border=\"0\">"; } ?></a>
    <a href="?page=forums"><? if($_GET['page'] == "forums") { echo "<img src=\"images/buttons/forums_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/forums.png\" border=\"0\">"; } ?></a>
    </div>
    </td></tr>
    </table>
<table width="750" border="0" align="center" cellpadding="10" cellspacing="0">
  <?php
  if($_SESSION['uname'] == "")
  {
  ?>
  <form method="post" action="?page=login" name="login" />
  <tr>
  	<td class="bod-login">
  		<p class="style2"><div align="right">Username: <input type="text" name="uname" /> Password: <input type="password" name="pass" /> <input type="submit" value="Login" /></div></p>
  	</td>
  </tr>
  </form>
  <?php
  } // end login area
  ?>
  <tr>
    <td bgcolor="#FFFFFF" class="main">
    <p><center>
    <?php
    if($_SESSION['ads'] != "1")
    	displayAd("center-rectangle");
    ?></center></p>
      <?php
      switch($_GET['page'])
      {
      	case "home":
      		include("pages/home.php");
      		break;
      	case "":
      		include("pages/home.php");
      		break;
      	case "about":
      		include("pages/about.php");
      		break;
      	case "members":
      		include("pages/members.php");
      		break;
      	case "register":
      		include("pages/register.php");
      		break;
      	case "login":
      		include("pages/login.php");
      		break;
      	case "logout":
      		include("pages/logout.php");
      		break;
      	case "search":
      		include("pages/search.php");
      		break;
      	case "stats":
      		include("pages/stats.php");
      		break;
      	case "contact":
      		include("pages/contact.php");
      		break;
      	case "404":
      		include("pages/404.php");
      		break;
      	case "error":
      		include("pages/error.php");
      		break;
      	case "get":
      		include("pages/get.php");
      		break;
      	case "flag":
      		include("pages/flag.php");
      		break;
      	case "tutorial":
      		include("pages/include.php");
      		break;
      	case "powered":
      		include("pages/powered.php");
      		break;
      }
      ?>
     </td>
  </tr>
</table>
<?php
	$endtime = microtime();
	$endarray = explode(" ", $endtime);
	$endtime = $endarray[1] + $endarray[0];
	$totaltime = $endtime - $starttime; 
	$totaltime = round($totaltime,5);
	$totalqueries = $_SESSION['totalqueries'];
	$sload = loadTest();
?>
<table width="750" border="0" align="center" cellpadding="5" cellspacing="0">
<tr>
	<td class="timebar-left"><div align="left"><a href="?page=powered"><img src="images/mysql_powered.png" border="0"></a> <a href="?page=powered"><img src="images/pwrd_apache.gif" border="0"></a> <a href="?page=powered"><img src="images/icon_mini-xml.png" border="0"></a> <a href="?page=powered"><img src="images/php-power-micro2.png" border="0"></a> <a href="?page=powered"><img src="images/linux_powered.gif" border="0"></a></div></td>
	<td class="timebar-mid" valign="middle"><div align="right" class="style5"><img src="images/loadtime.gif" alt="Load time">: <?php echo "$totaltime"; ?></div></td>
	<td class="timebar-mid"><div align="right" class="style5"><img src="images/queries.gif" alt="Database Queries Made" border="0">: <?php echo "$totalqueries"; ?></div></td>
	<td class="timebar-mid"><div align="right" class="style5"><img src="images/gzip.gif" alt="GZip Compression" align="" border="0">: <?php echo "$gzip_msg"; ?></div></td>
	<td class="timebar-mid"><div aling="right" class="style5"><img src="images/icon_user.gif" border="0">: <?php $online = getOnlineUsers(); echo "$online"; ?></div></td>
	<td class="timebar-right"><div align="right" class="style5"><img src="images/load.gif" alt="Server Load" border="0">: <?php echo "$sload"; ?></div></td>
</tr>
</table>
<?php include('bottom.php'); ?>
</body>
</html>
The URL is http://www.rapidhash.com/v2/
User avatar
Jonah Bron
DevNet Master
Posts: 2764
Joined: Thu Mar 15, 2007 6:28 pm
Location: Redding, California

Post by Jonah Bron »

Jcart wrote:Any particular reason not to use regular expression? It makes processing a lot simpler
Oh. Regurlar expression? Sounds interesting. I'll have to check into that (I was wondering what the forum Regex was)

Thanks.[/quote]
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Okay I have absolutely no idea what you are on about now. The regular expression matched the images fine when I tried it (see below for my output).

Appologies if I'm missing something obvious here..

Code: Select all

<pre>Array
(
    [0] => Array
        (
            [0] => <img src="images/logo_top.png
            [1] => <img src="images/buttons/home_on.png
            [2] => <img src="images/buttons/about.png
            [3] => <img src="images/buttons/members.png
            [4] => <img src="images/buttons/register.png
            [5] => <img src="images/buttons/search.png
            [6] => <img src="images/buttons/statistics.png
            [7] => <img src="images/buttons/contact.png
            [8] => <img src="images/buttons/tutorial.png
            [9] => <img src="images/buttons/forums.png
            [10] => <img src="images/mysql_powered.png
            [11] => <img src="images/pwrd_apache.gif
            [12] => <img src="images/icon_mini-xml.png
            [13] => <img src="images/php-power-micro2.png
            [14] => <img src="images/linux_powered.gif
            [15] => <img src="images/loadtime.gif
            [16] => <img src="images/queries.gif
            [17] => <img src="images/gzip.gif
            [18] => <img src="images/icon_user.gif
            [19] => <img src="images/load.gif
        )

    [1] => Array
        (
            [0] => images/logo_top.png
            [1] => images/buttons/home_on.png
            [2] => images/buttons/about.png
            [3] => images/buttons/members.png
            [4] => images/buttons/register.png
            [5] => images/buttons/search.png
            [6] => images/buttons/statistics.png
            [7] => images/buttons/contact.png
            [8] => images/buttons/tutorial.png
            [9] => images/buttons/forums.png
            [10] => images/mysql_powered.png
            [11] => images/pwrd_apache.gif
            [12] => images/icon_mini-xml.png
            [13] => images/php-power-micro2.png
            [14] => images/linux_powered.gif
            [15] => images/loadtime.gif
            [16] => images/queries.gif
            [17] => images/gzip.gif
            [18] => images/icon_user.gif
            [19] => images/load.gif
        )

)
</pre>
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

The images are malformed. <img src="images/logo_top.png is not valid HTML for an image: you need the closing quote and angled bracket.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Ohhh I see what you were doing now. The regular expression was not designed to capture the entire tag, only the src attribute. This will capture the entire tag and the src will be captured as well.

Code: Select all

/<img.*?src="([^"]+)".*?>/im
If all you want to do is just grab the entire tag though, you can simply the expression to

Code: Select all

/<img[^>]>/im
User avatar
waradmin
Forum Contributor
Posts: 240
Joined: Fri Nov 04, 2005 2:57 pm

Post by waradmin »

Ahhh, very clear now, thank you much! I learn a lot here!
Post Reply