Page 1 of 1

Modify all except links

Posted: Mon Nov 26, 2007 9:25 pm
by ntsf
I'd like to parse html to add links to given keywords, except for those contained within a link (basically anything inside <a href=... </a>)

Example: replace Yahoo with <a href="http://www.yahoo.com">Yahoo!</a> but leave
<a href="http://www.yahoo.com">Yahoo!</a> or <a href="http://www.google.com/yahoo">Not Yahoo but Google</a> as is.

It also needs to account for i.e. (yahoo) or Yahoo or <yahoo>.

I tried the simple

Code: Select all

[^\.yahoo\.]yahoo
but it's not as accurate as it needs to be.

Posted: Tue Nov 27, 2007 2:03 am
by GeertDD
Match keywords not followed by </a>:

Code: Select all

(keyword)(?!</a>)

Reply

Posted: Tue Nov 27, 2007 2:43 am
by ntsf
Thanks for the help. I don't think that's enough though.

Let's say I have:

Code: Select all

$a = '<pre>Dell (dell) wwwdell ....dell..  dell.  <a href="http://www.dell.com"> dell</a>';

$cat_rep = preg_replace('/\b(dell)(?!<\/a>)\b/im', '<a href="http://www.dell.com">dell</a>', $a);

echo $cat_rep;
The desired output would be:

Code: Select all

<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>..  <a href="http://www.dell.com">dell</a>.  <a href="http://www.dell.com"> dell</a>
However, using the above code, I get:

Code: Select all

<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>..  <a href="http://www.dell.com">dell</a>.  <a href="http://www.<a href="http://www.dell.com">dell</a>.com"> dell</a>
Thanks.

Posted: Tue Nov 27, 2007 4:53 am
by GeertDD
The regex is also replacing the word 'dell' in URLs. An easy solution in your case would be to add another negative lookahead option.

Code: Select all

\b(keyword)(?!</a>|\.com)

Continued...

Posted: Tue Nov 27, 2007 1:53 pm
by ntsf
Thanks, that works on that test string, but it's easy to break, for example on the string in my OP: <a href="http://www.google.com/yahoo">Not Yahoo but Google</a>

Since it's going to be on html I'm not creating I need it more reliable.

Is it possible to use a format that says "replace unless it's within a link tag" e.g. ignore anything between every instance of

Code: Select all

<a href=...</a>
something like (not working :)):

Code: Select all

$a = '<pre>Dell (dell) wwwdell ....dell..  dell.  <a href="http://www.dell.com"> dell </a><a href="http://www.google.com/dell">Not Dell but Google</a> ';

$cat_rep = preg_replace('/\b(dell)(?!<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>|<\/a>|\.com)/is', '<a href="http://www.dell.com">dell</a>', $a);

echo $cat_rep;
(using a href matching from viewtopic.php?t=73347)

Posted: Tue Nov 27, 2007 2:16 pm
by GeertDD
Try something like this and only replace capturing parentheses $1.

Code: Select all

<a\s(?:.*?)</a>|\b(keyword)

How

Posted: Tue Nov 27, 2007 3:02 pm
by ntsf
How do I replace the capturing parentheses $1? When I try that regex it's replacing the code in the link instead of ignoring it. Thanks.

Posted: Wed Nov 28, 2007 10:26 am
by GeertDD

Code: Select all

// Input:
$str = '<p>dell, (dell) and <a href="http://www.dell.be/">dell</a></p>';

$str = preg_replace_callback
(
	'#<a\s(?:.*?)</a>|\b(dell)\b#is',
	create_function
	(
		'$matches',
		'return (isset($matches[1])) ? \'<a href="http://www.dell.com">\'.$matches[1].\'</a>\' : $matches[0];'
	),
	$str
);

echo $str;

// Output:
// <p><a href="http://www.dell.com">dell</a>, (<a href="http://www.dell.com">dell</a>) and <a href="http://www.dell.be/">dell</a></p>

Thanks!

Posted: Wed Nov 28, 2007 1:15 pm
by ntsf
Perfect! Thanks so much!