Page 1 of 1

find in file_get_contents

Posted: Wed Feb 16, 2005 5:22 pm
by bimo
does 'nobr' or 'o' have any special meaning in regular expressions? I'm trying to use preg_match_all to pull out chunks that look like

Code: Select all

<p class=g>(<a href=http://...</font></nobr>
but every time I get to the 'o' in nobr it stops (I'm using a regular expression tool to build it called Regex Coach which shows me what a given pattern will match in a given string).
Here's my pattern thus far

Code: Select all

<p class=g>(<a href=http://&#1111;a-z1-9./?_=()&]*) onmousedown=&#1111;a-z1-9"' (),]*(>&#1111;a-z1-9./?_=()& -á-ú]*)


when I use it on a page like:
ousedown="return clk(this,'res',29)">iPodGeneration : Le podcasting est partout</a><font size=-1> - [ <a href=http://translate.google.com/translate?h ... D%26sa%3DN class=fl>Translate this page</a> ]</font><br><font size=-1><b>...</b> 27] Bonjour, Il ya des podcast francophones (au delà de la simple diffusion de fichiers<br>
musicaux): http://blog.saint-elie.com http://www. Voir <b>...</b>
<br><font color=#008000>www.ipodgeneration.com/fr/actu/805/ - 29k - </font><nobr> <a class=fl href="http://64.233.161.104/search?q=cache:hU ... >Cached</a> - <a class=fl href="/search?hl=en&lr=&q=related:www.ipodgeneration.com/fr/actu/805/">Si ... obr></font> <p class=g><a href=http://www.mesblogs.com/syndication.php3?id_syndic=114 onmousedown="return clk(this,'res',28)">mesblogs.com - Articles de Le blog à Ollie</a><font size=-1> - [ <a href=http://translate.google.com/translate?h ... D%26sa%3DN class=fl>Translate this page</a> ]</font><br><font size=-1><b>...</b> 4 février • Voeux chinois 2005 - 4 février • Interview - 4 février • WP 1.5 Gamma -<br>
4 février • Gouranga - 4 février • <b>Podcasteur</b>#7 - 3 février <b>...</b>
<br><font color=#008000>www.mesblogs.com/syndication.php3?id_syndic=114 - 41k - </font><nobr> <a class=fl href="http://64.233.161.104/search?q=cache:sE ... >Cached</a> - <a class=fl href="/search?hl=en&lr=&q=related:www.mesblogs.com/syndication.php3%3Fid_ ... obr></font>

<p class=g><a href=http://www.ipodgeneration.com/fr/actu/805/ onmousedown="return clk(this,'res',29)">iPodGeneration : Le podcasting est

it matches only the red part and I want it to go to the </nobr>.

Does anyone know what the problem is?

Thanks

Posted: Wed Feb 16, 2005 5:37 pm
by feyd
you want to tell us which modifiers you are using with it?

the pattern you posted will not match the example you want to match. Modifiers inside a pattern only work under very certain circumstances. Your expression does not meet any of those circumstances.

Posted: Wed Feb 16, 2005 6:34 pm
by bimo
I don't think that I'm using any modifiers. If you mean things like PREG_PATTERN_ORDER then I'm not using any.

The expression that I showed above is not formatted for php yet. Had it been put into my php script, I would have done something like,

Code: Select all

$pattern = addslashes('#<p class=g>(<a href=http://&#1111;a-z1-9./?_=()&]*) onmousedown=&#1111;a-z1-9"' (),]*(>&#1111;a-z1-9./?_=()& -á-ú]*)#i');
(I want to make theexpressions in the '()''s get put into the array as two separate elements - I'm not positive that you can pull multiple patterns out of a target and make a 2-d array but I could have sworn that I read that you could)

I am writing this pattern because right now I am using four different ones (going from including all links to select links by weeding out ones that I don't want). Yesterday I did some reading and found out that the first pattern of the four was way too greedy so now I'm going back to the beginning and writing it so that it is more "picky" from the start.

Here's the code I'm changing:

Code: Select all

<form method="get" action="pod_search5.php" name="pod_search">
	<input type="text" name="terms" id="terms" />
	<input type="hidden" name="target" id="target" value="http://www.google.com/search" />
	<input type="submit" value="find mph-cast" />
</form>

<?php
$search_terms = $_GET&#1111;'terms'];
$target_engine = $_GET&#1111;'target'];
$search_terms = str_replace(" ", "+", $search_terms);

$guy = web_search($search_terms, $target_engine);

function web_search($terms, $target) 
&#123;
   
	if($terms) 
	&#123;		
		$query = array();
		
		$query = "$target?hl=en&num=100&lr=&q=$terms";
		print($query . "<br>");
		
		$result = file_get_contents($query);
		//print($result);
		
		// gets all anchor tags on page
		// preg_match(pattern, string, container)
		$pattern = addslashes('#<a .*</a*>#i');
		$pattern2 = addslashes('#(<a .*href="http://.*</a>)#i');
		$pattern3 = addslashes('#(&nbsp;&nbsp;&nbsp;&nbsp;<a (0|&#1111;a-z1-9= -_\/"'':?.+&])*)>#i');
		print("pattern" . $pattern3 . "<br>");
		$pattern4 = addslashes('#<a(0|&#1111;a-z1-9= -_\/"'':?.+&])*Translate this page</a>#i');		
		preg_match_all($pattern, $result, $links);
		
		$pagelinks = array();
		$num = 0;
		
		for($i=0;$i<count($links&#1111;0]);$i++)
		&#123;		
			//print($links&#1111;0]&#1111;$i]);
			//if(preg_grep($pattern2 ,$links))
			if(!strpos($links&#1111;0]&#1111;$i], "http://")) continue;
			else &#123;
				$temp = preg_replace($pattern3, '', $links&#1111;0]&#1111;$i]);
				$temp2 = preg_replace('/<a.*Similar&nbsp;pages<\/a>.*<\/nobr>.*<\/font\>/i', '', $temp);
				//$temp3 = preg_replace($pattern4, '', $temp2);
				$pagelinks&#1111;$num] = preg_replace('/<a.*Cached.*<\/a>&#1111;^<]/i', '', $temp2);				 				
				print("<br />" . $num . " " . $pagelinks&#1111;$num]); //($pagelinks&#1111;$num] . "<br />"); 
				$num++; 
			&#125;
		&#125;
		
		//print($links&#1111;1]&#1111;1]);		
	&#125;		
	else print("enter search term");
&#125;
?>
[/i]