Page 1 of 1
Modify all except links
Posted: Mon Nov 26, 2007 9:25 pm
by ntsf
I'd like to parse html to add links to given keywords, except for those contained within a link (basically anything inside <a href=... </a>)
Example: replace Yahoo with <a href="http://www.yahoo.com">Yahoo!</a> but leave
<a href="http://www.yahoo.com">Yahoo!</a> or <a href="
http://www.google.com/yahoo">Not Yahoo but Google</a> as is.
It also needs to account for i.e. (yahoo) or Yahoo or <yahoo>.
I tried the simple
but it's not as accurate as it needs to be.
Posted: Tue Nov 27, 2007 2:03 am
by GeertDD
Match keywords not followed by </a>:
Reply
Posted: Tue Nov 27, 2007 2:43 am
by ntsf
Thanks for the help. I don't think that's enough though.
Let's say I have:
Code: Select all
$a = '<pre>Dell (dell) wwwdell ....dell.. dell. <a href="http://www.dell.com"> dell</a>';
$cat_rep = preg_replace('/\b(dell)(?!<\/a>)\b/im', '<a href="http://www.dell.com">dell</a>', $a);
echo $cat_rep;
The desired output would be:
Code: Select all
<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>.. <a href="http://www.dell.com">dell</a>. <a href="http://www.dell.com"> dell</a>
However, using the above code, I get:
Code: Select all
<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>.. <a href="http://www.dell.com">dell</a>. <a href="http://www.<a href="http://www.dell.com">dell</a>.com"> dell</a>
Thanks.
Posted: Tue Nov 27, 2007 4:53 am
by GeertDD
The regex is also replacing the word 'dell' in URLs. An easy solution in your case would be to add another negative lookahead option.
Continued...
Posted: Tue Nov 27, 2007 1:53 pm
by ntsf
Thanks, that works on that test string, but it's easy to break, for example on the string in my OP: <a href="
http://www.google.com/yahoo">Not Yahoo but Google</a>
Since it's going to be on html I'm not creating I need it more reliable.
Is it possible to use a format that says "replace unless it's within a link tag" e.g. ignore anything between every instance of
something like (not working

):
Code: Select all
$a = '<pre>Dell (dell) wwwdell ....dell.. dell. <a href="http://www.dell.com"> dell </a><a href="http://www.google.com/dell">Not Dell but Google</a> ';
$cat_rep = preg_replace('/\b(dell)(?!<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>|<\/a>|\.com)/is', '<a href="http://www.dell.com">dell</a>', $a);
echo $cat_rep;
(using a href matching from
viewtopic.php?t=73347)
Posted: Tue Nov 27, 2007 2:16 pm
by GeertDD
Try something like this and only replace capturing parentheses $1.
How
Posted: Tue Nov 27, 2007 3:02 pm
by ntsf
How do I replace the capturing parentheses $1? When I try that regex it's replacing the code in the link instead of ignoring it. Thanks.
Posted: Wed Nov 28, 2007 10:26 am
by GeertDD
Code: Select all
// Input:
$str = '<p>dell, (dell) and <a href="http://www.dell.be/">dell</a></p>';
$str = preg_replace_callback
(
'#<a\s(?:.*?)</a>|\b(dell)\b#is',
create_function
(
'$matches',
'return (isset($matches[1])) ? \'<a href="http://www.dell.com">\'.$matches[1].\'</a>\' : $matches[0];'
),
$str
);
echo $str;
// Output:
// <p><a href="http://www.dell.com">dell</a>, (<a href="http://www.dell.com">dell</a>) and <a href="http://www.dell.be/">dell</a></p>
Thanks!
Posted: Wed Nov 28, 2007 1:15 pm
by ntsf
Perfect! Thanks so much!