Preventing content/url duplication?

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Preventing content/url duplication?

Post by JAB Creations »

I found it a little surprising that I'm able to pull up the following content on both URL's...

viewtopic.php?f=6&t=94788&p=527853
viewtopic.php?f=6&p=527853&t=94788

...or perhaps I've applied the right idea to the wrong scenario? Do Google and other search engines consider that duplicated content or does this not apply to HTTP queries?
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Preventing content/url duplication?

Post by Chris Corbyn »

They're the same URL? All you've done is re-ordered the parameters in the query portion of the URL.
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Preventing content/url duplication?

Post by JAB Creations »

After I posted I came across this duplicate content page on Google. I don't think this would count as duplicate content. It seems more like if you don't implement an Apache (via .htaccess) script to choose either http://example.com/ or http://www.example.com/ one would risk creating duplicated content in example.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Preventing content/url duplication?

Post by Ambush Commander »

I'm not sure if Google canonicalizes GET query strings, but an easy way to canonicalize URLs yourself is to check QUERY_STRING for the ordering of variables, and redirect the user to the "real" URL if necessary.
User avatar
allspiritseve
DevNet Resident
Posts: 1174
Joined: Thu Mar 06, 2008 8:23 am
Location: Ann Arbor, MI (USA)

Re: Preventing content/url duplication?

Post by allspiritseve »

There's also Google's new canonical tag that should help with these types of problems.
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Preventing content/url duplication?

Post by JAB Creations »

Thanks allspiritseve I don't think I found that though I've been using the base element (which I use PHP to determine the base href based on the domain name such as my site or localhost) so every single link on my site uses www in example.

Ambush Commander, yeah I considered that though it seems to be a low priority if any. It is easy to explode the HTTP queries and ensure that each query is at a required point to make the URL valid. I do think Google would trust the internal site links over external site links at least to a reasonable extent.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Preventing content/url duplication?

Post by Chris Corbyn »

allspiritseve wrote:There's also Google's new canonical tag that should help with these types of problems.
We recently had a SEO meeting at work and I learned about this. Sounds like a useful concept.
User avatar
allspiritseve
DevNet Resident
Posts: 1174
Joined: Thu Mar 06, 2008 8:23 am
Location: Ann Arbor, MI (USA)

Re: Preventing content/url duplication?

Post by allspiritseve »

Chris Corbyn wrote:
allspiritseve wrote:There's also Google's new canonical tag that should help with these types of problems.
We recently had a SEO meeting at work and I learned about this. Sounds like a useful concept.
Yeah, I guess Yahoo and MSN agreed to support the tag as well (at some point, dunno when)
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Preventing content/url duplication?

Post by JAB Creations »

There's no such thing as a "tag" in XHTML, they're called elements.

Yahoo has supported the class robots-nocontent. Now why the heck would I want to start using all sorts of retarded non-standard XHTML elements that don't exist in any established standards? :|
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Preventing content/url duplication?

Post by JAB Creations »

Oh it's a link element?

Code: Select all

<link href="http://example.com/page.html" rel="canonical" />
User avatar
allspiritseve
DevNet Resident
Posts: 1174
Joined: Thu Mar 06, 2008 8:23 am
Location: Ann Arbor, MI (USA)

Re: Preventing content/url duplication?

Post by allspiritseve »

JAB Creations wrote:There's no such thing as a "tag" in XHTML, they're called elements.
Uh... ok?

Whatever you call it, you use

Code: Select all

<link rel="canonical" href="http://www.google.com" />
which is a tag/element that already exists, so not necessarily "unstandard".
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Preventing content/url duplication?

Post by Chris Corbyn »

Yeah it's perfectly valid and a <link> is exactly the right place for it.
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Preventing content/url duplication?

Post by JAB Creations »

I'm very strict about standards. There was this poor guy in another thread who wasted hours upon hours because he was missing a quote. By using application/xhtml+xml while I wouldn't be made aware of low-priority validation errors like duplicate ID's a missing quote would break the page, give me an error message, and I'd have the problem solved in half a minute at most. So by holding ourselves to standards we ensure consistency...and we save a whole lot of time in the long term. So I make a huge effort to use the correct terminology. This means I also end up in a lot of unique situations asking questions so I try to leave bread crumbs for those who do a search using standards compliant terminology.

Google
http://google.com/support/webmasters/bi ... wer=139394

MSDN
http://blogs.msdn.com/webmaster/archive ... ssues.aspx

Yahoo
http://ysearchblog.com/2009/02/12/fight ... ur-quiver/

I can see how the canonical link element will be useful for HTTP query dependent pages such as my forums.

However good practices can generally avoid the issue. A few Apache scripts can make a world of difference and using the base XHTML element in example. All my site's anchors and images all add the base element as the first half of the URL and then adds the anchor|image's href attribute's value as the second half. Having a consistent way to name files, etc.

This is definitely something I'll end up implementing in the 29th version of my site. Thanks for the heads up guys. :)
User avatar
allspiritseve
DevNet Resident
Posts: 1174
Joined: Thu Mar 06, 2008 8:23 am
Location: Ann Arbor, MI (USA)

Re: Preventing content/url duplication?

Post by allspiritseve »

JAB Creations wrote:I'm very strict about standards.
Being strict about standards is fine... colloquial language has it's place too, though. Even W3C uses the word tag:
W3C wrote:Essentially this means that all elements must either have closing tags or be written in a special form
JAB Creations wrote:I can see how the canonical link element will be useful for HTTP query dependent pages such as my forums.
Duplicate content happens in other situations as well, even with clean urls (for instance, /blog/latest/post-name/ and /blog/archives/post-name) to make up a trivial example.
JAB Creations wrote:However good practices can generally avoid the issue. A few Apache scripts can make a world of difference and using the base XHTML element in example. All my site's anchors and images all add the base element as the first half of the URL and then adds the anchor|image's href attribute's value as the second half. Having a consistent way to name files, etc.
I don't see how the base tag solves the issue. Can you elaborate?
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Preventing content/url duplication?

Post by JAB Creations »

The base element rocks for actually a couple of reasons.

First off it makes running the same site both locally and live a snap! I have two PHP class variables (base1 and base2).

For example my current project has the following values for localhost...
base1 = http://localhost
base2 = /Version%202.9.A.3/

In a live environment it will end up being...
base1 = http://www.example.com
base2 = /

Now take an anchor or image element's href attribute's value...
images/logo.gif

Well ignoring PHP and looking directly at the XHTML output you'll simply add the address up as so...
base1.base2.img.src

So...
http://www.example.com/images/logo.gif

The only time I use absolute URL's is when I link externally.

But any way the base element is most useful to me for being able to run the same site in any environment regardless of the various file paths. By using these practices my site's URL's are pretty clean.

I've been having a lot of fun building my site's new CMS system and it's pretty nice timing to hear about this as it'll be a snap to implement. My site has a new PHP CMS class that handles file paths including the base element and what HTTP code I should include in the headers (which thankfully Apache now logs to the server access log). So if the page is 304 or 200 I'll serve the canonical link element however if it's not a 304 or 200 then I change the robots meta element to "NOINDEX, NOFOLLOW".
Post Reply