Page 1 of 1
Explode but not really - Looking to pull out tags
Posted: Wed Dec 19, 2007 3:11 pm
by waradmin
I am trying to code a little script that goes through the HTML source of some of my pages, and pulls out the images and displays just the images. So essentially I just want to capture the content between the <img /> tag, however I am having problems with this.
The dummy junk code I came up with
Code: Select all
$partA = explode("<img src=", $result);
$count = count($partA);
for($i=0;$i<$count;$i++)
{
$tmp = $partA[$i];
$partB = explode("/>", $tmp);
$img = $partB[0];
$img = str_replace("'", "", $img);
echo "<img src=$img>";
}
It displays the images however it also ends up echoing text. Is there a better way to do this, IE have php search a string for, say [/b]<img[/b] and capture the values until it reaches the end of an extension like
.gif and then store just that into an array?
Sorry if its confusing, thanks for any help provided!
-Steve
Posted: Wed Dec 19, 2007 3:15 pm
by John Cartwright
Your better off using regular expression to capture all the img src contents. Untested.
Code: Select all
preg_match_all('/<img.*?src="([^"]+)/im', $htmlsource, $matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
Posted: Wed Dec 19, 2007 3:25 pm
by waradmin
Alright cool that makes a lot of sense. Now it is semi working, however it is displaying this after the image tags:
Code: Select all
%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
So it is like:
Code: Select all
http://www.someserver.com/images/image.png%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
Any idea why its doing this?
Posted: Wed Dec 19, 2007 3:31 pm
by Jonah Bron
Hmmm. I've been thinking about a symilar process for getting <a> tags, for an index. Maybe this?
Code: Select all
<?php
echo "<html><head><title>Only pictures</title></head><body>\n";
$document = file_get_contents($_SERVER['HTTP_SELF);//get current file contents
$document = explode('<img src="', $document);//explode at <img src="
$new_doc = array();//create var for new doc output
for ($i=1;$i<count($document);$i++){//skip first part of $document, becuase it is before the first <img src="
$new_doc[$i-1] = $document[$i];//add a new part to new document, with exploded img, without the first part
}
foreach ($new_doc as $sub_doc){//above was so we could just use foreach now. loop through each seperate part of new document
$sub_doc = explode('"', $sub_doc);//explode sub document, so it is now a nested array.
}
/*now, each part of the new document is $new_document[eachpart][0]. below, output all pictures:*/
foreach ($new_doc as $sub_doc){//loop through each doc
echo $sub_doc[0] ."<br />\n";//output each one
}
die("</body></html>");//don't show the rest of the page.
Posted: Wed Dec 19, 2007 3:48 pm
by John Cartwright
waradmin wrote:Alright cool that makes a lot of sense. Now it is semi working, however it is displaying this after the image tags:
Code: Select all
%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
So it is like:
Code: Select all
http://www.someserver.com/images/image.png%20%20%20%20%20%20%20%20%20%20%20%20[5]%20=%3E%20%3Cimg%20style='vertical-align:bottom'%20alt=
Any idea why its doing this?
I'm not exactly sure what you mean. Can you post an example source?
PHPyoungster wrote:Hmmm. I've been thinking about a symilar process for getting <a> tags, for an index. Maybe this?
Code: Select all
<?php
echo "<html><head><title>Only pictures</title></head><body>\n";
$document = file_get_contents($_SERVER['HTTP_SELF);//get current file contents
$document = explode('<img src="', $document);//explode at <img src="
$new_doc = array();//create var for new doc output
for ($i=1;$i<count($document);$i++){//skip first part of $document, becuase it is before the first <img src="
$new_doc[$i-1] = $document[$i];//add a new part to new document, with exploded img, without the first part
}
foreach ($new_doc as $sub_doc){//above was so we could just use foreach now. loop through each seperate part of new document
$sub_doc = explode('"', $sub_doc);//explode sub document, so it is now a nested array.
}
/*now, each part of the new document is $new_document[eachpart][0]. below, output all pictures:*/
foreach ($new_doc as $sub_doc){//loop through each doc
echo $sub_doc[0] ."<br />\n";//output each one
}
die("</body></html>");//don't show the rest of the page.
Any particular reason not to use regular expression? It makes processing a lot simpler

Posted: Wed Dec 19, 2007 3:55 pm
by waradmin
Alright, the code posted second just displays a bunch of
h on the screen.
As for the code that was being buggy:
Code: Select all
<?php
function getRemoteFile($url)
{
// get the host name and url path
$parsedUrl = parse_url($url);
$host = $parsedUrl['host'];
if (isset($parsedUrl['path'])) {
$path = $parsedUrl['path'];
} else {
$path = '/';
}
if (isset($parsedUrl['query'])) {
$path .= '?' . $parsedUrl['query'];
}
if (isset($parsedUrl['port'])) {
$port = $parsedUrl['port'];
} else {
// most sites use port 80
$port = '80';
}
$timeout = 10;
$response = '';
// connect to the remote server
$fp = @fsockopen($host, '80', $errno, $errstr, $timeout );
if( !$fp ) {
echo "Cannot retrieve $url";
} else {
// send the necessary headers to get the file
fputs($fp, "GET $path HTTP/1.0\r\n" .
"Host: $host\r\n" .
"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3\r\n" .
"Accept: */*\r\n" .
"Accept-Language: en-us,en;q=0.5\r\n" .
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n" .
"Keep-Alive: 300\r\n" .
"Connection: keep-alive\r\n" .
"Referer: http://$host\r\n\r\n");
// retrieve the response from the remote server
while ( $line = fread( $fp, 4096 ) ) {
$response .= $line;
}
fclose( $fp );
// strip the headers
$pos = strpos($response, "\r\n\r\n");
$response = substr($response, $pos + 4);
}
// return the file content
return $response;
}
$result = getRemoteFile($_GET['path']);
preg_match_all("/<img.*?src=\"([^\"]+)/im", $result, $matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
?>
Outputs:
Code: Select all
Array
(
[0] => Array
(
[0] => Array
(
[0] => images/logo_top.png
[1] => images/buttons/home_on.png
[2] => images/buttons/about.png
[3] => images/buttons/members.png
[4] => images/buttons/register.png
[5] => images/buttons/search.png
[6] => images/buttons/statistics.png
[7] => images/buttons/contact.png
[8] => images/buttons/tutorial.png
[9] => images/buttons/forums.png
[10] => images/mysql_powered.png
[11] => images/pwrd_apache.gif
[12] => images/icon_mini-xml.png
[13] => images/php-power-micro2.png
[14] => images/linux_powered.gif
[15] => images/loadtime.gif
[16] => images/queries.gif
[17] => images/gzip.gif
[18] => images/icon_user.gif
[19] => images/load.gif
)
)
Note here is the real output I am getting (screenshot because i cant copy and paste dead images):

(Note the blue question mark boxes are the images that are not displaying, thats how Safari displays dead images)
And the "image source" for the images that are not displaying are like:
Code: Select all
http://localhost/~steve/images/logo_top.png%20%20%20%20%20%20%20%20%20%20%20%20[1]%20=%3E%20%3Cimg%20src=
Where it should be:
Posted: Wed Dec 19, 2007 4:01 pm
by John Cartwright
It is capturing the image contents just fine.. I don't get what you are trying to say? The reason why the images are not showing on on your output and appearing broken because they are relative urls. If you want to actually display the images you'll need to preprend the domain name to the source.
Posted: Wed Dec 19, 2007 4:08 pm
by waradmin
What I dont get is why there are 10 images like: "
http://localhost/~steve/images/logo_top ... img%20src=" and the remainder do not look like that.
Posted: Wed Dec 19, 2007 4:16 pm
by John Cartwright
Can you post a link or the source code itself -- of the site you are scraping and not your own code.
Posted: Wed Dec 19, 2007 4:17 pm
by waradmin
Just to be clear I am doing this on my own sites, here is the code:
Code: Select all
<?php
include('top.php');
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>RapidHash - v2 [-UNFINISHED BETA-]</title>
<link rel="stylesheet" href="style.css" type="text/css" />
<link rel="stylesheet" href="includes/tooltips.css" type="text/css" media="screen" />
</head>
<body topmargin="0" leftmargin="0" rightmargin="0">
<table width="100%" border="0" cellpadding="0" cellspacing="0" align="center">
<tr><td><div align="center"><img src="images/logo_top.png"></div></td></tr>
</table>
<table width="750" border="0" cellspacing="0" cellpadding="0" background="images/buttons/bg.png" align="center">
<tr><td>
<div align="center">
<a href="?page=home"><? if(($_GET['page'] == "") || ($_GET['page'] == "home")) { echo "<img src=\"images/buttons/home_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/home.png\" border=\"0\">"; } ?></a>
<a href="?page=about"><? if($_GET['page'] == "about") { echo "<img src=\"images/buttons/about_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/about.png\" border=\"0\">"; } ?></a>
<a href="?page=members"><? if($_GET['page'] == "members") { echo "<img src=\"images/buttons/members_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/members.png\" border=\"0\">"; } ?></a>
<?php if($_SESSION['uname'] == "") { ?><a href="?page=register"><? if($_GET['page'] == "register") { echo "<img src=\"images/buttons/register_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/register.png\" border=\"0\">"; } ?></a><?php } ?>
<?php if($_SESSION['uname'] != "") { ?><a href="?page=logout"><? if($_GET['page'] == "logout") { echo "<img src=\"images/buttons/logout_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/logout.png\" border=\"0\">"; } ?></a><?php } ?>
<a href="?page=search"><? if($_GET['page'] == "search") { echo "<img src=\"images/buttons/search_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/search.png\" border=\"0\">"; } ?></a>
<a href="?page=stats"><? if($_GET['page'] == "stats") { echo "<img src=\"images/buttons/statistics_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/statistics.png\" border=\"0\">"; } ?></a>
<a href="?page=contact"><? if($_GET['page'] == "contact") { echo "<img src=\"images/buttons/contact_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/contact.png\" border=\"0\">"; } ?></a>
<a href="?page=tutorial"><? if($_GET['page'] == "tutorial") { echo "<img src=\"images/buttons/tutorial_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/tutorial.png\" border=\"0\">"; } ?></a>
<a href="?page=forums"><? if($_GET['page'] == "forums") { echo "<img src=\"images/buttons/forums_on.png\" border=\"0\">"; } else { echo "<img src=\"images/buttons/forums.png\" border=\"0\">"; } ?></a>
</div>
</td></tr>
</table>
<table width="750" border="0" align="center" cellpadding="10" cellspacing="0">
<?php
if($_SESSION['uname'] == "")
{
?>
<form method="post" action="?page=login" name="login" />
<tr>
<td class="bod-login">
<p class="style2"><div align="right">Username: <input type="text" name="uname" /> Password: <input type="password" name="pass" /> <input type="submit" value="Login" /></div></p>
</td>
</tr>
</form>
<?php
} // end login area
?>
<tr>
<td bgcolor="#FFFFFF" class="main">
<p><center>
<?php
if($_SESSION['ads'] != "1")
displayAd("center-rectangle");
?></center></p>
<?php
switch($_GET['page'])
{
case "home":
include("pages/home.php");
break;
case "":
include("pages/home.php");
break;
case "about":
include("pages/about.php");
break;
case "members":
include("pages/members.php");
break;
case "register":
include("pages/register.php");
break;
case "login":
include("pages/login.php");
break;
case "logout":
include("pages/logout.php");
break;
case "search":
include("pages/search.php");
break;
case "stats":
include("pages/stats.php");
break;
case "contact":
include("pages/contact.php");
break;
case "404":
include("pages/404.php");
break;
case "error":
include("pages/error.php");
break;
case "get":
include("pages/get.php");
break;
case "flag":
include("pages/flag.php");
break;
case "tutorial":
include("pages/include.php");
break;
case "powered":
include("pages/powered.php");
break;
}
?>
</td>
</tr>
</table>
<?php
$endtime = microtime();
$endarray = explode(" ", $endtime);
$endtime = $endarray[1] + $endarray[0];
$totaltime = $endtime - $starttime;
$totaltime = round($totaltime,5);
$totalqueries = $_SESSION['totalqueries'];
$sload = loadTest();
?>
<table width="750" border="0" align="center" cellpadding="5" cellspacing="0">
<tr>
<td class="timebar-left"><div align="left"><a href="?page=powered"><img src="images/mysql_powered.png" border="0"></a> <a href="?page=powered"><img src="images/pwrd_apache.gif" border="0"></a> <a href="?page=powered"><img src="images/icon_mini-xml.png" border="0"></a> <a href="?page=powered"><img src="images/php-power-micro2.png" border="0"></a> <a href="?page=powered"><img src="images/linux_powered.gif" border="0"></a></div></td>
<td class="timebar-mid" valign="middle"><div align="right" class="style5"><img src="images/loadtime.gif" alt="Load time">: <?php echo "$totaltime"; ?></div></td>
<td class="timebar-mid"><div align="right" class="style5"><img src="images/queries.gif" alt="Database Queries Made" border="0">: <?php echo "$totalqueries"; ?></div></td>
<td class="timebar-mid"><div align="right" class="style5"><img src="images/gzip.gif" alt="GZip Compression" align="" border="0">: <?php echo "$gzip_msg"; ?></div></td>
<td class="timebar-mid"><div aling="right" class="style5"><img src="images/icon_user.gif" border="0">: <?php $online = getOnlineUsers(); echo "$online"; ?></div></td>
<td class="timebar-right"><div align="right" class="style5"><img src="images/load.gif" alt="Server Load" border="0">: <?php echo "$sload"; ?></div></td>
</tr>
</table>
<?php include('bottom.php'); ?>
</body>
</html>
The URL is
http://www.rapidhash.com/v2/
Posted: Wed Dec 19, 2007 8:01 pm
by Jonah Bron
Jcart wrote:Any particular reason not to use regular expression? It makes processing a lot simpler
Oh. Regurlar expression? Sounds interesting. I'll have to check into that (I was wondering what the forum Regex was)
Thanks.[/quote]
Posted: Wed Dec 19, 2007 8:12 pm
by John Cartwright
Okay I have absolutely no idea what you are on about now. The regular expression matched the images fine when I tried it (see below for my output).
Appologies if I'm missing something obvious here..
Code: Select all
<pre>Array
(
[0] => Array
(
[0] => <img src="images/logo_top.png
[1] => <img src="images/buttons/home_on.png
[2] => <img src="images/buttons/about.png
[3] => <img src="images/buttons/members.png
[4] => <img src="images/buttons/register.png
[5] => <img src="images/buttons/search.png
[6] => <img src="images/buttons/statistics.png
[7] => <img src="images/buttons/contact.png
[8] => <img src="images/buttons/tutorial.png
[9] => <img src="images/buttons/forums.png
[10] => <img src="images/mysql_powered.png
[11] => <img src="images/pwrd_apache.gif
[12] => <img src="images/icon_mini-xml.png
[13] => <img src="images/php-power-micro2.png
[14] => <img src="images/linux_powered.gif
[15] => <img src="images/loadtime.gif
[16] => <img src="images/queries.gif
[17] => <img src="images/gzip.gif
[18] => <img src="images/icon_user.gif
[19] => <img src="images/load.gif
)
[1] => Array
(
[0] => images/logo_top.png
[1] => images/buttons/home_on.png
[2] => images/buttons/about.png
[3] => images/buttons/members.png
[4] => images/buttons/register.png
[5] => images/buttons/search.png
[6] => images/buttons/statistics.png
[7] => images/buttons/contact.png
[8] => images/buttons/tutorial.png
[9] => images/buttons/forums.png
[10] => images/mysql_powered.png
[11] => images/pwrd_apache.gif
[12] => images/icon_mini-xml.png
[13] => images/php-power-micro2.png
[14] => images/linux_powered.gif
[15] => images/loadtime.gif
[16] => images/queries.gif
[17] => images/gzip.gif
[18] => images/icon_user.gif
[19] => images/load.gif
)
)
</pre>
Posted: Wed Dec 19, 2007 8:19 pm
by Ambush Commander
The images are malformed. <img src="images/logo_top.png is not valid HTML for an image: you need the closing quote and angled bracket.
Posted: Wed Dec 19, 2007 8:28 pm
by John Cartwright
Ohhh I see what you were doing now. The regular expression was not designed to capture the entire tag, only the src attribute. This will capture the entire tag and the src will be captured as well.
If all you want to do is just grab the entire tag though, you can simply the expression to
Posted: Wed Dec 19, 2007 9:51 pm
by waradmin
Ahhh, very clear now, thank you much! I learn a lot here!