Page 1 of 1

Regex SEO Links: Technical Error

Posted: Fri Mar 19, 2010 12:41 pm
by djstaz0ne
Hi,

I am working on a SEO Keyword linking script and i need help fixing a bug in my regex pattern.

Code: Select all

$keywordz = array("api integration salesforce integration","salesforce integration","api integration","web services");
Here is my pattern ($url can be any url):

Code: Select all

for($i=0;$i<count($keywordz);$i++){
$pattern = '!(<[^a][^>]*>[^<]*)('.$keywordz[$i].')!i';
$replacement = '$1<a href="/'.$url.'">$2</a>';
$text = preg_replace($pattern, $replacement, $text);
}

Here is the text the pattern is having problems with:

Code: Select all

<h1>Salesforce Integration and API Integration in New York</h1>
<p>Perpetual Technologies Unltd. specializes in api integration and Salesforce Integration in New York City. We have extensive experience in integrating CRM Systems with additional corporate data, utilizing web services APIs - working with SOAP and PHP. We can connect your "online work requests" to salesforce, automatically storing them in a database, and automatically generating and emailing "job tickets", listing all relevant project-related information. Salesforce Integration will help your company operate more smoothly.</p>
The Problem:
The first occurrence of "Salesforce Integration" does not get hyperlinked..

Can anyone help me out??

Thanks in advance,

Nik

Re: Regex SEO Links: Technical Error

Posted: Fri Mar 19, 2010 4:27 pm
by ridgerunner
djstaz0ne wrote:... The Problem:
The first occurrence of "Salesforce Integration" does not get hyperlinked..
Actually, the first occurrence of "Salesforce Integration" does get hyperlinked (The one following the <h1> tag). After running your code this is the result string I get (with the "missing links" highlighted in red):

Code: Select all

<h1><a href="/http://example.com">Salesforce Integration</a> and <a href="/http://example.com">API Integration</a> in New York</h1>
<p>Perpetual Technologies Unltd. specializes in <a href="/http://example.com">api integration</a> and [color=#FF0000][b]Salesforce Integration[/b][/color] in New York City. We have extensive experience in integrating CRM Systems with additional corporate data, utilizing <a href="/http://example.com">web services</a> APIs - working with SOAP and PHP. We can connect your "online work requests" to salesforce, automatically storing them in a database, and automatically generating and emailing "job tickets", listing all relevant project-related information. <a href="/http://example.com">Salesforce Integration</a> will help your company operate more smoothly.</p>
Here is what your regex is saying: Following any HTML opening or closing tag (other than an opening anchor tag), find the last occurance of the keyword prior to any left angle bracket and capture it in group 2.

And it is doing exactly what you are asking it to do! To illustrate the problem, lets change the sub-expression right before the keyword to use a lazy rather than greedy quantifier, by adding the '?' ungreedy modifier like so:

Code: Select all

for($i=0;$i<count($keywordz);$i++){
$pattern = '!(<[^a][^>]*>[^<]*?)('.$keywordz[$i].')!i';
$replacement = '$1<a href="/'.$url.'">$2</a>';
$text = preg_replace($pattern, $replacement, $text);
}
When you run this regex on your test data, you now match only the first occurance of the keyword like so:

Code: Select all

<h1><a href="/http://example.com">Salesforce Integration</a> and <a href="/http://example.com">API Integration</a> in New York</h1>
<p>Perpetual Technologies Unltd. specializes in <a href="/http://example.com">api integration</a> and <a href="/http://example.com">Salesforce Integration</a> in New York City. We have extensive experience in integrating CRM Systems with additional corporate data, utilizing <a href="/http://example.com">web services</a> APIs - working with SOAP and PHP. We can connect your "online work requests" to salesforce, automatically storing them in a database, and automatically generating and emailing "job tickets", listing all relevant project-related information. [color=#FF0000][b]Salesforce Integration[/b][/color] will help your company operate more smoothly.</p>
It appears that your intent is to add hyperlinks to all keywords that have not already been linkified, but this is obviously not what this regex is doing. In order to make sure that you do not add a hyperlink inside another hyperlink, you need to match whole hyperlinks and skip doing any repolacement inside them. This can be accomplished using a modified regex and the preg_replace_callback function like so:

Code: Select all

<?php // test.php version 2010-03-19
$text = '<h1>Salesforce Integration and API Integration in New York</h1>
<p>Perpetual Technologies Unltd. specializes in api integration and Salesforce Integration in New York City. We have extensive experience in integrating CRM Systems with additional corporate data, utilizing web services APIs - working with SOAP and PHP. We can connect your "online work requests" to salesforce, automatically storing them in a database, and automatically generating and emailing "job tickets", listing all relevant project-related information. Salesforce Integration will help your company operate more smoothly.</p>';
 
$keywordz = array(
    "api integration salesforce integration", // order of this array is important
    "salesforce integration",
    "api integration",
    "web services");
 
$url = 'http://example.com';
 
for ($i = 0; $i < count($keywordz); $i++) {
    $pattern = '!(<a\b[^>]*>.*?</a>)|('.$keywordz[$i].')!i';
    $text = preg_replace_callback($pattern, 're_callback', $text);
}
function re_callback($matches) {
    global $url;
    if ($matches[1]) {             // Case 1: this is a <a..>...</a>
        return $matches[1];        // return it unmodified
    }
    elseif ($matches[2]) {          // Case 2: a non-linked keyword
        return '<a href="/'.$url.'">'.$matches[2].'</a>';
    }
    exit("Error!");                // never get here
}
file_put_contents('out.txt', $text);
?>
Hope this helps! :)

Re: Regex SEO Links: Technical Error

Posted: Fri Mar 19, 2010 4:58 pm
by djstaz0ne
Thank you for your help. :D

I didn't know about the uses for the preg_replace_callback function.
Now my regex endeavors should be a lot easier.

Thanks,

-Nik