Page 1 of 1

RegEX Code not working at all

Posted: Sat Dec 17, 2005 6:57 pm
by bbentp
Hello everyone,

I'm ok with the very simple regular expressions that I would nornally need to use to design a site with, but when it comes to imtermediate or advanced regEXs.. I get totally lost!

I made a script that pull from my mySQL database adn returns an article or news story. After pulling the PHP script compares the article/news body to a glossary I've setup for terms. I have it working perfectly to match the terms in the glossary... The only issue is it matches all matches whether it's in HTML Tags or not...

I need it so that it won't match any terms found that are in any HTML tags unless it's a '<span>' or '<strong>'...

I've listed the code below and appreciate any suggestions or corrections that can be provided...


/* Pull each glossary term from DB */

Code: Select all

$keywordsArray = array();
	 $queryA  = "SELECT * FROM glossary WHERE deviceGlossary != 'device' ORDER BY title DESC";
	 $resultA = mysql_query($queryA) or die("Died: ".mysql_error());

/* Run through each term */
	 
	 while ($rows = mysql_fetch_object($resultA)) {
	 
                 $keywordsArray[$rows->title] = $rows->id;
	 $addKeyword = preg_replace("/, /", ",", $rows->add_keywords);
	 $addKeywords = explode(",", $rows->add_keywords); 
	  for ($y=0; $y<count($addKeywords); $y++) {
	 	$keywordsArray[$addKeywords[$y]] = $rows->id;
	  }
		   	
	 }	

/* Terms loaded into array */
 
	 $foundMatches = array_combine($keywordsArray, $idsArray); 
	 $contents = explode(" ", $artMainContents);


/* Foreach term compare to article/news.. and replace in the body (article/news story) */
	 
	 foreach($keywordsArray as $title=>$titid) {
	  if ($title != "") {
      $artMainContents  = preg_replace('#\b('.preg_quote($title,'#').')\b#i',"<a href=\"javascript: nullVoid();\" onClick=\"javascript: showTerm('".$titid."', 'default');\" class=\"glossaryTerm\" title=\"Lookup term in the wireless glossary.\">\\1</a>",$artMainContents);
	  }
	 }

Posted: Sat Dec 17, 2005 9:21 pm
by shoebappa
How bout this:

Code: Select all

$artMainContents  = preg_replace('#(<span[^>]*>|<strong[^>]*>)([^<]*)\b('.preg_quote($title,'#').')\b#i',"\\1\\2<a href=\"javascript: nullVoid();\" onClick=\"javascript: showTerm('".$titid."', 'default');\" class=\"glossaryTerm\" title=\"Lookup term in the wireless glossary.\">\\3</a>",$artMainContents);
Don't ask me why it's changing colons to entities, so if you copy the code, make sure to look for the &#058 instead of the colons after Javascript...

I hope that's what you mean, basically matches the title when it's within a span or strong. At least it should... It wouldn't work if there are nested tags in between.

So it would match <span attributes>this is a title or <strong>this is a title, but not, <span attributes>this is a <em>title</em>, or even <span attributes><em>this is</em>a title. I just have it saying not a < so as soon as there is one, it stops. I guess you could say not an ending tag, or even not an ending span or strong tag. Note that since it's matching that stuff you need to put it back when it's replaced, so I added the \\1\\2 to the fron of the anchor.

I think there's a way to say see if it's there but ignore it, but I haven't used that yet. Someone on here the other day brought up "look behinds" and I think they would be ideal here if I knew how to use them : )

Side note, the PHP tags instead of just Code will highlight the php code enclosed, to make it easier to read.

Posted: Sat Dec 17, 2005 10:43 pm
by Burrito
Moved to Regex

Well..

Posted: Sat Dec 17, 2005 11:14 pm
by bbentp
That helped my mental state of RegEX, in terms of being a good lesson of matching tags, but as for the script it basically does the same exact thing as it did before... Basically I want it to match the terms.. as long as they're not in a <a>|<img> tag..

I included my link to the current page in question for a better idea of what I'm talking about. Look at the page and scroll down and you'll notice where the image should be, but since it matches 'Cingular' as a term if changes the code and this is what I'm trying to avoid..


http://www.umts-hsdpa.com/index.php/New ... r.3G.Users

If you look at the middle of the page you see this line:

Code: Select all

cingular.cingular.net/mycingular/SupportingFiles/GIF/hbo_mobile_logo_s.gif" border="0" alt="HBO Mobile" />
Which should be and image tag like this:

Code: Select all

<img src="http://cingular.cingular.net/mycingular/SupportingFiles/GIF/hbo_mobile_logo_s.gif" border="0" alt="HBO Mobile" />
That's the major problem that I'm running into. The whole article is in a <span>, I need to match anything but a <img>|<a>, so as not to interrupt any images or links already designated that may contain those keywords.

Thank you again for you assistance!!!