I'm writing a PHP script which will take a list of product names from a database (each name is probably 1-3 words in length) and go through the text of a blog post and link any of the names on the list back to a page about that item. Additionally -- and this is key -- I need to make sure it DOES NOT turn the term into a link if it is already inside a link (inside of <a> </a> tags)
The data table contains the name of the product (ex: HP Pavilion dv4t or "iPhone") and the URL for the link to it. There will be thousands of product names though very few will be used in any given blog post (maybe 1 to 4) so I don't think it's practical to do a search replace for each them on each post, but I could be wrong. I suppose I could have it compare every 3,2, and 1 word combination in each blog post to each product name.
What would you suggest?
Automatically linking 1-3 word phrases in article text
Moderator: General Moderators
-
geekinchief
- Forum Newbie
- Posts: 2
- Joined: Fri Sep 11, 2009 2:31 pm
-
peterjwest
- Forum Commoner
- Posts: 63
- Joined: Tue Aug 04, 2009 1:06 pm
Re: Automatically linking 1-3 word phrases in article text
I'm assuming the blog posts you are searching are plain text except for formatting and link html tags, which are complete. e.g.
You'll actually be fine doing a find and replace, its a very minor amount of string formatting and if you use regexes it will be very fast.
A find and replace is easy, however you need to avoid replacing the contents of existing links, you also need to avoid replacing the contents of single or double quotes, and ALSO you need to ignore links within single or double quotes.
The best way to do this is probably a tokeniser. The idea of a tokeniser is to split the post up into an array of tokens, which can be iterated through and processed individually. Split into relevant tokens the above post might be (with each line being a token):
As you iterate through the above list, you can toggle single and double quotes 'on' and 'off', and providing quotes are 'off' you can toggle <a> tags 'on' and 'off'. Whenever all of these are 'off' you can safely find and replace terms. At the end you can simply implode the list to a string.
Code: Select all
This is a blog post. Man I love <italic>iPhones</italic> they are so <bold value1=">" value2='>'>damn great</bold>. Here's a link to the <a href="http://www.iphone.com">iPhoneStore</a>. I'm going to go rub myself against my iPhone now, cya!A find and replace is easy, however you need to avoid replacing the contents of existing links, you also need to avoid replacing the contents of single or double quotes, and ALSO you need to ignore links within single or double quotes.
The best way to do this is probably a tokeniser. The idea of a tokeniser is to split the post up into an array of tokens, which can be iterated through and processed individually. Split into relevant tokens the above post might be (with each line being a token):
Code: Select all
This is a blog post. Man I love <italic>iPhones</italic> they are so <bold value1=
"
>
"
value2=
'
>
'
>damn great</bold>. Here's a link to the
<a
href=
"
http://www.iphone.com
"
>
iPhoneStore
</a>
. I'm going to go rub myself against my iPhone now, cya!-
geekinchief
- Forum Newbie
- Posts: 2
- Joined: Fri Sep 11, 2009 2:31 pm
Re: Automatically linking 1-3 word phrases in article text
I wonder, though if each word should be a token, rather than each line. However, I might need to run this through 3 times, one for 3 words in a row, one for two words in a row, and another for individual words.
-
peterjwest
- Forum Commoner
- Posts: 63
- Joined: Tue Aug 04, 2009 1:06 pm
Re: Automatically linking 1-3 word phrases in article text
You can tokenise each word if you like, but then it would be more difficult to search for a list of words. Searching through several times should be fairly easy but make sure you don't match individual words if you've already matched the whole phrase. After each replacement you may need to run the tokeniser on each subsection of the post. A recursive function might be useful here.