Any questions involving matching text strings to patterns - the pattern is called a "regular expression."
Moderator: General Moderators
ntsf
Forum Newbie
Posts: 5 Joined: Mon Nov 26, 2007 9:14 pm
Post
by ntsf » Mon Nov 26, 2007 9:25 pm
I'd like to parse html to add links to given keywords, except for those contained within a link (basically anything inside <a href=... </a>)
Example: replace Yahoo with <a href="http://www.yahoo.com">Yahoo!</a> but leave
<a href="http://www.yahoo.com">Yahoo!</a> or <a href="
http://www.google.com/yahoo">Not Yahoo but Google</a> as is.
It also needs to account for i.e. (yahoo) or Yahoo or <yahoo>.
I tried the simple
but it's not as accurate as it needs to be.
Last edited by
ntsf on Tue Nov 27, 2007 2:44 am, edited 1 time in total.
GeertDD
Forum Contributor
Posts: 274 Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium
Post
by GeertDD » Tue Nov 27, 2007 2:03 am
Match keywords not followed by </a>:
ntsf
Forum Newbie
Posts: 5 Joined: Mon Nov 26, 2007 9:14 pm
Post
by ntsf » Tue Nov 27, 2007 2:43 am
Thanks for the help. I don't think that's enough though.
Let's say I have:
Code: Select all
$a = '<pre>Dell (dell) wwwdell ....dell.. dell. <a href="http://www.dell.com"> dell</a>';
$cat_rep = preg_replace('/\b(dell)(?!<\/a>)\b/im', '<a href="http://www.dell.com">dell</a>', $a);
echo $cat_rep;
The desired output would be:
Code: Select all
<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>.. <a href="http://www.dell.com">dell</a>. <a href="http://www.dell.com"> dell</a>
However, using the above code, I get:
Code: Select all
<pre><a href="http://www.dell.com">dell</a> (<a href="http://www.dell.com">dell</a>) wwwdell ....<a href="http://www.dell.com">dell</a>.. <a href="http://www.dell.com">dell</a>. <a href="http://www.<a href="http://www.dell.com">dell</a>.com"> dell</a>
Thanks.
GeertDD
Forum Contributor
Posts: 274 Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium
Post
by GeertDD » Tue Nov 27, 2007 4:53 am
The regex is also replacing the word 'dell' in URLs. An easy solution in your case would be to add another negative lookahead option.
ntsf
Forum Newbie
Posts: 5 Joined: Mon Nov 26, 2007 9:14 pm
Post
by ntsf » Tue Nov 27, 2007 1:53 pm
Thanks, that works on that test string, but it's easy to break, for example on the string in my OP: <a href="
http://www.google.com/yahoo">Not Yahoo but Google</a>
Since it's going to be on html I'm not creating I need it more reliable.
Is it possible to use a format that says "replace unless it's within a link tag" e.g. ignore anything between every instance of
something like (not working
):
Code: Select all
$a = '<pre>Dell (dell) wwwdell ....dell.. dell. <a href="http://www.dell.com"> dell </a><a href="http://www.google.com/dell">Not Dell but Google</a> ';
$cat_rep = preg_replace('/\b(dell)(?!<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>|<\/a>|\.com)/is', '<a href="http://www.dell.com">dell</a>', $a);
echo $cat_rep;
(using a href matching from
viewtopic.php?t=73347 )
GeertDD
Forum Contributor
Posts: 274 Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium
Post
by GeertDD » Tue Nov 27, 2007 2:16 pm
Try something like this and only replace capturing parentheses $1.
ntsf
Forum Newbie
Posts: 5 Joined: Mon Nov 26, 2007 9:14 pm
Post
by ntsf » Tue Nov 27, 2007 3:02 pm
How do I replace the capturing parentheses $1? When I try that regex it's replacing the code in the link instead of ignoring it. Thanks.
GeertDD
Forum Contributor
Posts: 274 Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium
Post
by GeertDD » Wed Nov 28, 2007 10:26 am
Code: Select all
// Input:
$str = '<p>dell, (dell) and <a href="http://www.dell.be/">dell</a></p>';
$str = preg_replace_callback
(
'#<a\s(?:.*?)</a>|\b(dell)\b#is',
create_function
(
'$matches',
'return (isset($matches[1])) ? \'<a href="http://www.dell.com">\'.$matches[1].\'</a>\' : $matches[0];'
),
$str
);
echo $str;
// Output:
// <p><a href="http://www.dell.com">dell</a>, (<a href="http://www.dell.com">dell</a>) and <a href="http://www.dell.be/">dell</a></p>
ntsf
Forum Newbie
Posts: 5 Joined: Mon Nov 26, 2007 9:14 pm
Post
by ntsf » Wed Nov 28, 2007 1:15 pm
Perfect! Thanks so much!