Page 1 of 2

Grabbing text between tags

Posted: Fri May 21, 2004 5:51 am
by JayBird
I have no idea how i would go about doing this.

Say i have some made up tags like [stats]lots and lots of random stuff can appear here[/stats]

How can i grab everything between the [stats] [/stats] tags so i end up with a variable containing "lots and lots of random stuff can appear here".

Thanks

Mark

Posted: Fri May 21, 2004 5:57 am
by JayBird
this seems to work, but is it the best way!?

Code: Select all

$text="[stats]text I want[/stats]"; 

function strip($f1,$f2,$text,&$pos){ 
	if(!is_integer($pos)){
		$pos=false;
		return false;
	} 

	$pos1=strpos($text,$f1,$pos); 
	if(!is_integer($pos1)) {
		$pos=false;
		return false;
	} 

	$pos1+=strlen($f1); 
	$pos2=strpos($text,$f2,$pos1); 
	if(!is_integer($pos2)) {
		$pos=false;
		return false;
	} 

	$res=substr($text,$pos1,$pos2-$pos1); 
	$pos=$pos2+strlen($f2); 

return $res; 
} 

$textiwant=strip("[stats]","[/stats]",$text,$pos=0); 

echo $textiwant;
Mark

Posted: Fri May 21, 2004 6:10 am
by redmonkey
Not tested but....

Code: Select all

<?php

$string = '[stats]lots and lots of random stuff can appear here[/stats] with some stuff [stats]lots more random stuff can appear here[/stats]';

if (preg_match_all('/\[stats](.*?)\[\/stats]/', $string, $matches, PREG_SET_ORDER))
{
  foreach($matches as $match)
  {
    echo $match[1] . "\x0a";
  }
}

?>
results in....

Code: Select all

lots and lots of random stuff can appear here
lots more random stuff can appear here

Posted: Fri May 21, 2004 6:12 am
by vigge89
what about using RegExp?
Here's an simple one (took from my site):
"#\[stats\](.*?)\[/stats\]#is"
matches everything between [stats] and [/stats]

Posted: Fri May 21, 2004 6:15 am
by JayBird
ah yes looks good! thanks

Okay, next probelm, would it be AT ALL feasable to grab the text in red in the code below

The only thing that will be constant is the fact that it is the 7th table in the file.

I don't know much about this area of PHP, but is there some way i could count upto the 7th <table> tag, then grab everything from there to the </table> tag!?
<html>
<head>
<title>General Statistics</title>
<link rel="stylesheet" href="Report.css" type="text/css">
</head>
<body bgcolor="#FFFFFF" text="#000000" leftmargin="0" topmargin="0" rightmargin="0" bottommargin="0" marginwidth="0" marginheight="0">
<center>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td rowspan="2" width="65"><img src="logo.gif" width="65" height="52"></td>
<td align="center"><table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td background="head_back.gif"><img src="TreeBlank.gif" width="45" height="45"></td>
<td width="100%" align="center" valign="middle" background="head_back.gif" nowrap><span class="ReportTitle">Report
for ipd-Glamorgan: </span> <span class="CategoryTitle">General
Statistics</span> </td>
<td width="112"><a href="http://www.weblogexpert.com/" target="_blank"><img src="powered.gif" width="112" height="45" border="0"></a></td>
</tr>
</table></td>
</tr>
<tr>
<td><table width="100%" border="0" cellspacing="0" cellpadding="0" height="7">
<tr>
<td background="top_line.gif"></td>
</tr>
</table></td>
</tr>
</table>
<table width="90%" border=0 cellpadding=1 cellspacing=1>
<tr>
<td valign="top" align="left">Time range: 13/05/2004 11:27:57 - 19/05/2004
15:24:24</td>
<td valign="top" align="right">Generated on Wed Apr 21, 2004 - 10:39:35</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td height="7"></td>
</tr>
<tr>
<td height="7" background="top_line.gif"></td>
</tr>
</table>
<br>
<a name="Summary"></a>
<table cellpadding="0" border="0" cellspacing="0" width="90%">
<tr>
<td width="10"><img src="section_left.gif" width="10" height="20" border="0"></td>
<td class="SectionTitle" nowrap>Summary</td>
<td width="10"><img src="section_right.gif" width="10" height="20" border="0"></td>
</tr>
</table>
<p></p>
<span class="TableTitle">Summary</span><br>
<table border=0 cellspacing=0 cellpadding=0 height=6>
<tr>
<td></td>
</tr>
</table>
<table border=0 bgcolor="#000000" cellspacing=0 cellpadding=0 width="90%">
<tr>
<td><table border=0 cellspacing=1 cellpadding=2 width="100%">
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Hits</td>
</tr>
<tr class="TableRow1">
<td width="100%" class="TableCell">Total Hits</td>
<td width="0%" class="TableCell">344</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Hits per Day</td>
<td class="TableCell">49</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Average Hits per Visitor</td>
<td class="TableCell">43.00</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Cached Requests</td>
<td class="TableCell">14</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Failed Requests</td>
<td class="TableCell">0</td>
</tr>
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Page Views</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Page Views</td>
<td class="TableCell">13</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Page Views per Day</td>
<td class="TableCell">1</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Average Page Views per Visitor</td>
<td class="TableCell">1.63</td>
</tr>
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Visitors</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Visitors</td>
<td class="TableCell">8</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Visitors per Day</td>
<td class="TableCell">1</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Unique IPs</td>
<td class="TableCell">6</td>
</tr>
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Bandwidth</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Bandwidth</td>
<td class="TableCell">678.71&nbsp;KB</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Bandwidth per Day</td>
<td class="TableCell">96.96&nbsp;KB</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Average Bandwidth per Hit</td>
<td class="TableCell">1.97&nbsp;KB</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Bandwidth per Visitor</td>
<td class="TableCell">84.84&nbsp;KB</td>
</tr>
</table>
</td>
</tr>
</table>
<p> <br>
<p>&nbsp</p>
</center>
</body>
</html>
Mark

Posted: Fri May 21, 2004 6:24 am
by leenoble_uk
Coincidentally I've been working on something similar while I'm procrastinating.
I'm extracting the cricket score from the channel 4 website. There's some nasty preg expressions in there and I need to redo the whole thing from scratch but it seems pertinent.

Code: Select all

<?php
$url = "http://www.channel4.com/sport/cricket/latest_score.html";
$teamNames = array(array("NZ","ENG"),array("New Zealand","England"));

$page = file_get_contents($url);
preg_match("/<table width=134 [^>]+>\W*<tr>\W*<td colspan=2><p align="center">([^<>]+)<\/p><\/td>/",$page,$bits);
preg_match("/<tr>\W*<td><p><b>([^><]+)<\/b><\/p><\/td>\W*<td align=right>\W*<p>([^<>]+)\W<\/p>\W*<\/td>\W*<\/tr>\W*<tr>\W*<td><p><b>([^><]+)<\/b><\/p><\/td>\W*<td align=right>\W*<p>([^<>]+)\W<\/p>\W*<\/td>\W*<\/tr>/",$page,$morebits);

$match = $bits[1];
$match = explode(",",str_replace($teamNames[0], $teamNames[1], $match));
$ground = $match[2];
$match = $match[0];
$news = $bits[2]?"<br>".$bits[2]:"";
//echo $currentInningsScore;

$firstToBat = $morebits[1];
$firstToBat = str_replace($teamNames[0], $teamNames[1], $firstToBat);
$firstInningsScore = $morebits[2];
preg_match("/^([[]]+)-([[]]+)/",$morebits[2],$scoreText);
$firstInningsText = $scoreText[1]." - ".$scoreText[2];
$firstInningsScore = preg_replace("/[[]]/","<img src="images/\\0.gif" class="score">",$firstInningsScore);
$firstInningsScore = preg_replace("/-/","<img src="images/hyphen.gif" class="scoreboard">",$firstInningsScore);

$secondToBat = $morebits[3];
$secondToBat = str_replace($teamNames[0], $teamNames[1], $secondToBat);
$secondInningsScore = $morebits[4];
preg_match("/^([[]]+)-([[]]+)/",$morebits[4],$scoreText2);
$secondInningsText = $scoreText2[1]." - ".$scoreText2[2];
$secondInningsScore = preg_match("/[[]]+-[[]]+/", $morebits[4])?$morebits[4]:"0-0";
$secondInningsScore = preg_replace("/[[]]/","<img src="images/\\0.gif" class="score">",$secondInningsScore);
$secondInningsScore = preg_replace("/-/","<img src="images/hyphen.gif" class="scoreboard">",$secondInningsScore);
echo <<<PAGE
<html>
<head>
<title>Cricket</title>
<meta http-equiv=refresh content=30>
<meta http-equiv=expires content=0>
<meta http-equiv=pragma content=no-cache>
<script language="javascript">
<!--
function setTitle() {
	var s;
	
			s = '$firstToBat $firstInningsText / $secondToBat $secondInningsText';
				
	window.document.title = s;
}

//-->
</script>
<style>
body {
	BACKGROUND-COLOR: #222222;
	COLOR: white;
	font-family:'Lucida Grande', Arial, Helvetica, sans-serif;
	margin:0px;
}
.title {
	font-weight:bold;
}
.inbat {
	font-size: 40px;
	font-weight: bold;
	line-height: 45px;
	margin-bottom: 5px;
	margin-top: 10px;
}
.score {
	border-width: 1px;
	border-style: solid;
	border-color: #111111 #444444 #777777 #222222;
	margin: 5px 6px;
}
.scoreboard {
	margin: 5px 6px;
}
.currentInnings {
}
.top {
	padding-top:40px;
	background-color: #222222;
	padding-bottom:7px;
	border-bottom: 1px dotted white;
	margin: 0px;
}
.bottom {
	margin: 0px;
	font-weight: normal;
	font-size: 0.7em;
	padding:7px 10px 40px 10px;
	background-color: #222222;
	border-top: 1px dotted white;
}
.middle {
	padding-top: 10px;
	padding-bottom: 10px;
	background-color: black;
	margin: 0px;
}
</style>
</head>
<body onload="setTitle();">
<div align="center" class="title"><p class="top">{$match}$news</p><div class="middle"><p class="inbat">$firstToBat</p><div class="currentInnings">$firstInningsScore</div>
<p class="inbat">$secondToBat</p><div class="currentInnings">$secondInningsScore</div></div>
<p class="bottom">This page is put together by a bank of intelligent pigeons pecking information into a computer keyboard every 30 seconds as it is relayed by some of the fastest racing birds of their type straight from the $ground cricket ground purely for the benefit of James Webb</p>
</div>
<!-- CHANNEL 4 STRUCTURE

		<table width=134 cellspacing=0 cellpadding=0 border=0>
		<tr>
			<td colspan=2><p align="center">ENG v NZ, 1st Test, Lord's, 20 May 16:36</p></td>
		</tr>
		
			<tr>
				<td colspan=2><p align="center"><font color="red"><b>News Alert: drinks</b></font></p></td>
			</tr>
		
		<tr>
			<td><p><b>NZ</b></p></td>
			<td align=right>
			
				<p>205-4&nbsp;
				
				</p>
			
			</td>
		</tr>
		<tr>
			<td><p><b>ENG</b></p></td>
			<td align=right>
			
				<p>&nbsp;</p>
			
			</td>
		</tr>
		</table>
		<table width=134 cellspacing=0 cellpadding=0 border=0>
		<tr>
	 		<td nowrap><p>MH Richardson&nbsp;</p></td>
			<td align=right><p>68&nbsp;</p></td>
		</tr>
		<tr>
	 		<td nowrap><p>JDP Oram*</p></td>
			<td align=right><p>17&nbsp;</p></td>
		</tr>
		<tr>
			<td colspan=2><p>Bowler: SP  Jones</p></td>
		</tr>
		</table>
		
		$page
-->

</body>
</html>
PAGE;
?>
While I'm here. Can someone above explain how the expression

Code: Select all

/\&#1111;stats](.*?)\&#1111;\/stats]/
works step by step becuase I was having the same issue, extracting between tags and I thought that the .*? would match everything it came across INCLUDING the following [/stats] tag.
This is why in my case I've had to strip it down to the last remaining pair and match [^<]+ instead.

Posted: Fri May 21, 2004 6:26 am
by JayBird
Jesus christ man!!!

8O 8O 8O 8O 8O 8O 8O 8O 8O 8O

Posted: Fri May 21, 2004 6:35 am
by redmonkey
Not tested...

Code: Select all

<?php

$string = <<<EOF
<html>
<head>
<title>General Statistics</title>
<link rel="stylesheet" href="Report.css" type="text/css">
</head>
<body bgcolor="#FFFFFF" text="#000000" leftmargin="0" topmargin="0" rightmargin="0" bottommargin="0" marginwidth="0" marginheight="0">
<center>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td rowspan="2" width="65"><img src="logo.gif" width="65" height="52"></td>
<td align="center"><table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td background="head_back.gif"><img src="TreeBlank.gif" width="45" height="45"></td>
<td width="100%" align="center" valign="middle" background="head_back.gif" nowrap><span class="ReportTitle">Report
for ipd-Glamorgan: </span> <span class="CategoryTitle">General
Statistics</span> </td>
<td width="112"><a href="http://www.weblogexpert.com/" target="_blank"><img src="powered.gif" width="112" height="45" border="0"></a></td>
</tr>
</table></td>
</tr>
<tr>
<td><table width="100%" border="0" cellspacing="0" cellpadding="0" height="7">
<tr>
<td background="top_line.gif"></td>
</tr>
</table></td>
</tr>
</table>
<table width="90%" border=0 cellpadding=1 cellspacing=1>
<tr>
<td valign="top" align="left">Time range: 13/05/2004 11:27:57 - 19/05/2004
15:24:24</td>
<td valign="top" align="right">Generated on Wed Apr 21, 2004 - 10:39:35</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td height="7"></td>
</tr>
<tr>
<td height="7" background="top_line.gif"></td>
</tr>
</table>
<br>
<a name="Summary"></a>
<table cellpadding="0" border="0" cellspacing="0" width="90%">
<tr>
<td width="10"><img src="section_left.gif" width="10" height="20" border="0"></td>
<td class="SectionTitle" nowrap>Summary</td>
<td width="10"><img src="section_right.gif" width="10" height="20" border="0"></td>
</tr>
</table>
<p></p>
<span class="TableTitle">Summary</span><br>
<table border=0 cellspacing=0 cellpadding=0 height=6>
<tr>
<td></td>
</tr>
</table>
<table border=0 bgcolor="#000000" cellspacing=0 cellpadding=0 width="90%">
<tr>
<td><table border=0 cellspacing=1 cellpadding=2 width="100%">
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Hits</td>
</tr>
<tr class="TableRow1">
<td width="100%" class="TableCell">Total Hits</td>
<td width="0%" class="TableCell">344</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Hits per Day</td>
<td class="TableCell">49</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Average Hits per Visitor</td>
<td class="TableCell">43.00</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Cached Requests</td>
<td class="TableCell">14</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Failed Requests</td>
<td class="TableCell">0</td>
</tr>
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Page Views</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Page Views</td>
<td class="TableCell">13</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Page Views per Day</td>
<td class="TableCell">1</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Average Page Views per Visitor</td>
<td class="TableCell">1.63</td>
</tr>
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Visitors</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Visitors</td>
<td class="TableCell">8</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Visitors per Day</td>
<td class="TableCell">1</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Unique IPs</td>
<td class="TableCell">6</td>
</tr>
<tr class="TableSolidRow">
<td colspan="2" class="TableCell">Bandwidth</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Total Bandwidth</td>
<td class="TableCell">678.71 KB</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Bandwidth per Day</td>
<td class="TableCell">96.96 KB</td>
</tr>
<tr class="TableRow1">
<td class="TableCell">Average Bandwidth per Hit</td>
<td class="TableCell">1.97 KB</td>
</tr>
<tr class="TableRow2">
<td class="TableCell">Average Bandwidth per Visitor</td>
<td class="TableCell">84.84 KB</td>
</tr>
</table></td>
</tr>
</table>
<p> <br>
<p>&nbsp</p>
</center>
</body>
</html>
EOF;


if (preg_match('/^\s*?<td>(<table border=0 cellspacing=1.*?^\s*?<\/table>)<\/td>/ims', $string, $matches))
{
    echo $matches[1] . "\x0a";
}

?>

Posted: Fri May 21, 2004 6:49 am
by redmonkey
leenoble_uk,
The regex....

Code: Select all

/\&#1111;stats](.*?)\&#1111;\/stats]/
Can be broken down as follows....

The leading and trailing slash are the start and end delimiters.

The next part... \[stats] is looking for the literal string [stats] the leading backslash is required as the square bracket is a special character within regex.

the (.*?) will match everything, but as it is followed by \[\/stats] it will only match everything up to that point. It is worth noting that *? is non-greedy i.e. it will only match everything up to the next occurence of [/stats], whereas * is greedy and would match everything up to the last occurence of [/stats].

Posted: Fri May 21, 2004 6:55 am
by leenoble_uk
Cool, wish I knew that before I started.
Like I said I'll probably re-write it all at some point as it's only partially finished anyway.
I want to extract all the figures and use them to create other figures like target to win and last wicket and create a more full scorecard.
I should really try and find an xml doc with it in but couldn't in the few minutes I looked around yesterday.
Not helped at the moment bt c4's page being buggered and not displaying england's score.

Posted: Fri May 21, 2004 8:19 am
by JayBird
Redmonkey, that worked a right treat. Can you explain how it worked!?

Thanks

Mark

Posted: Fri May 21, 2004 9:15 am
by redmonkey
Explanations have never been my strong point but I can try....

As I'm sure you are aware, regex is all about pattern matching so essentially all I did was have a quick look to try and find/spot the unique start and end patterns within the string. The line which contains the beginning of your text is/was....

Code: Select all

<td><table border=0 cellspacing=1 cellpadding=2 width="100%">
I had a quick glance through the entire code an noted only one other line which started with <td><table which was...

Code: Select all

<td><table width="100%" border="0" cellspacing="0" cellpadding="0" height="7">
Which although starts the same is completely different, so this was my starting point....

Code: Select all

preg_match('/^\s*?&lt;td&gt;(&lt;table border=0 cellspacing=1
Breaking this starting point down is as follows....

The ^ character defines that that it is the start of a line (requires you to specify the 'm' (multiline) modifier).

\s means any space character and *? means zero or more occurrences of. I guessed/assumed that there was probably some indentation formatting of the code prior to posting on the forum so the \s*? would take care of this.

<td>(<table border=0 cellspacing=1 This is the unique start string (the opening parenthesis is the start of the sub pattern) actually if I had looked closer I would have realized that the '<table border=0 cellspacing=1' portion could have been shortened to just '<table b' as this would still be unique but as I just had a quick look I decided not to take that chance that I may have missed some other similar line.

The next part....

Code: Select all

.*?
.... is the match all syntax (this has been extended to match everything including newlines by way of the 's' (dot all) modifier)

The last part....

Code: Select all

^\s*?&lt;\/table&gt;)&lt;\/td&gt;
Similar to the first part again, the ^ signifies it is the start of a line and I have again guess/assumed about the indentation. And the <\/table>)<\/td> again was unique (the closing parenthesis defines the end of the subpattern)

Finally the modifiers... ims...

i means that the pattern search will be case insensitive (probably not required in this case but would mean if for some reason the code was reformatted in uppercase tags the regex would still work.

m (multiline mode) this means that ^ and $ will match new lines within the string respectively(the default behaviour is that ^ and $ match the start and end of the entire string)

s (dot all) by using this modifier the . character will match everything including newlines (without it newlines are excluded from the match).

Some points to note, as it was thrown together quickly, the regex was/is longer than need be. As previously mentioned the start pattern could have been shortened as to could the .*? as in this case greedy or non-greedy the pattern match would end in the same place so the regex could be....

Code: Select all

if (preg_match('/^\s*?<td>(<table b.*^\s*?<\/table>)<\/td>/ims', $string, $matches))
and still give the same results.

I hope some of that make sense?

Posted: Fri May 21, 2004 9:18 am
by JayBird
Remonkey, thanks a lot for that explanation. I will digest it when i have a minute.

Great work. My regex skills have never been brilliant

Image

Mark

Posted: Fri May 21, 2004 10:06 am
by redmonkey
LOL at that gif, thanks Bech100.

One thing that makes learning regex so difficult is that it is alot harder to read regex patterns than it is to write them.

It does help if you can break the pattern down into smaller sections but still, that can be a job in itself at times, especially if you are still in the early stages of learning.

Posted: Fri May 21, 2004 10:37 am
by JayBird
Me again,

a little problem.

When i try and get the HTML by reading a file using get_file_content, the script returns no matches!?

Code: Select all

// This finds and returns the General Statistics
function generalStatistics ()
{

	//global $_SESSION['company'];

	//$company = $_SESSION['company'];

	$company = "eurochem";

	$path = $_SERVER['DOCUMENT_ROOT']."/client_stats/$company/General_Statistics.htm";

	$string = file_get_contents($path);

	//echo html_entity_decode($string);

	if (preg_match('/^\s*?<td>(<table border=0 cellspacing=1.*?^\s*?<\/table>)<\/td>/ims', $string, $matches)) 
	{ 
	    echo $matches[1] . "\x0a"; 
	} else {
		echo "no matches";
	}
}

generalStatistics();
Mark