Modify all except links

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
ntsf
Forum Newbie
Posts: 5
Joined: Mon Nov 26, 2007 9:14 pm

Modify all except links

Post by ntsf »

I'd like to parse html to add links to given keywords, except for those contained within a link (basically anything inside <a href=... </a>)

Example: replace Yahoo with <a href="http://www.yahoo.com">Yahoo!</a> but leave
<a href="http://www.yahoo.com">Yahoo!</a> or <a href="http://www.google.com/yahoo">Not Yahoo but Google</a> as is.

It also needs to account for i.e. (yahoo) or Yahoo or <yahoo>.

I tried the simple

Code: Select all

[^\.yahoo\.]yahoo
but it's not as accurate as it needs to be.
Last edited by ntsf on Tue Nov 27, 2007 2:44 am, edited 1 time in total.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Post by GeertDD »

Match keywords not followed by </a>:

Code: Select all

(keyword)(?!</a>)
ntsf
Forum Newbie
Posts: 5
Joined: Mon Nov 26, 2007 9:14 pm

Reply

Post by ntsf »

Thanks for the help. I don't think that's enough though.

Let's say I have:

Code: Select all

$a = '<pre>Dell (dell) wwwdell ....dell..  dell.  <a href="http://www.dell.com"> dell</a>';

$cat_rep = preg_replace('/\b(dell)(?!<\/a>)\b/im', '<a href="http://www.dell.com">dell</a>', $a);

echo $cat_rep;
The desired output would be:

Code: Select all

<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>..  <a href="http://www.dell.com">dell</a>.  <a href="http://www.dell.com"> dell</a>
However, using the above code, I get:

Code: Select all

<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>..  <a href="http://www.dell.com">dell</a>.  <a href="http://www.<a href="http://www.dell.com">dell</a>.com"> dell</a>
Thanks.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Post by GeertDD »

The regex is also replacing the word 'dell' in URLs. An easy solution in your case would be to add another negative lookahead option.

Code: Select all

\b(keyword)(?!</a>|\.com)
ntsf
Forum Newbie
Posts: 5
Joined: Mon Nov 26, 2007 9:14 pm

Continued...

Post by ntsf »

Thanks, that works on that test string, but it's easy to break, for example on the string in my OP: <a href="http://www.google.com/yahoo">Not Yahoo but Google</a>

Since it's going to be on html I'm not creating I need it more reliable.

Is it possible to use a format that says "replace unless it's within a link tag" e.g. ignore anything between every instance of

Code: Select all

<a href=...</a>
something like (not working :)):

Code: Select all

$a = '<pre>Dell (dell) wwwdell ....dell..  dell.  <a href="http://www.dell.com"> dell </a><a href="http://www.google.com/dell">Not Dell but Google</a> ';

$cat_rep = preg_replace('/\b(dell)(?!<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>|<\/a>|\.com)/is', '<a href="http://www.dell.com">dell</a>', $a);

echo $cat_rep;
(using a href matching from viewtopic.php?t=73347)
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Post by GeertDD »

Try something like this and only replace capturing parentheses $1.

Code: Select all

<a\s(?:.*?)</a>|\b(keyword)
ntsf
Forum Newbie
Posts: 5
Joined: Mon Nov 26, 2007 9:14 pm

How

Post by ntsf »

How do I replace the capturing parentheses $1? When I try that regex it's replacing the code in the link instead of ignoring it. Thanks.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Post by GeertDD »

Code: Select all

// Input:
$str = '<p>dell, (dell) and <a href="http://www.dell.be/">dell</a></p>';

$str = preg_replace_callback
(
	'#<a\s(?:.*?)</a>|\b(dell)\b#is',
	create_function
	(
		'$matches',
		'return (isset($matches[1])) ? \'<a href="http://www.dell.com">\'.$matches[1].\'</a>\' : $matches[0];'
	),
	$str
);

echo $str;

// Output:
// <p><a href="http://www.dell.com">dell</a>, (<a href="http://www.dell.com">dell</a>) and <a href="http://www.dell.be/">dell</a></p>
ntsf
Forum Newbie
Posts: 5
Joined: Mon Nov 26, 2007 9:14 pm

Thanks!

Post by ntsf »

Perfect! Thanks so much!
Post Reply