CMS page lookup (301 | 404)???

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

CMS page lookup (301 | 404)???

Post by alex.barylski »

So I'm busily working away on my own CMS and I reach the point where it's time to implement page lookup.

Obviously speed is a factor, so using PKID like I have in times past:

Code: Select all

index.php?pageid=12
Would work awesome, unfortunately we live in a day and age (and I read enough forums, blogs, etc) where non-SEF URI's are seen as amaturish, so I need a way to construct (or let my users rather) SEF/SEO URI's from the page being displayed.

Some useful page fields:

- Title
- Alias
- Section

There are other fields, like keywords, etc, however they are potentially to verbose to be used for URIs.

If I cannot rely on a PKID, than my system must intelligently discover the PKID of a page given any number of URI parameters like the above.

Alternatively, I make the extension the PKID or something similar in the URI, that can be used as a direct lookup. The added bonus I see to using a PKID is that, in the case of a cache miss, the system would need to use the title or keywords and/or a combination of many fields to try and determine the PKID.

This is fine (although quite complex once I get above a few parameters -- I'm thinking a user plugin here would be best?) except that (unlike the PKID) the title and alias and other details can be changed by the user.

So what happens when the user updates their web pages and the URI's change from say:

Code: Select all

domain.com/articles/101-web-marketing-best-practices.html
To something a little more accurate as their content was updated to include more focus on SEO:

Code: Select all

domain.com/articles/internet-marketing-seo-best-practices.html
Now say, someone does a Google and finds the link to the above. They click the link, my system parses the URI and tries to find ONE page matching those keywords. Unfortunately the keywords/title/description/content has changed and the page is therefore not the only one to be returned.

Not strictly a bad thing, but had the URI included the PKID, then regardless of whether titles and meta data changed, clicking on an old link would have resulted in the user being redirected to the intended page, sorta -- albeit updated page.

So is it the same page?

Perhaps the best way to address this issue to keep an interal log of pages which have had their URI/permalinks changed and store the translation in a lookup table and use that to find pages which have moved...

I suppose this requires asking some SEO questions...liek which affects rankings more:

1. Having old URI redirect to a new URI and send 301
2. Having old URI show as "Page not found" and send 404 with a listing of article links which are suitable matches

WHat do you think?

Lastly, I suppose I could implement a admin feature where the user entered their permalink manually (like WordPress) and once created cannot be changed, and if it is, then record a redirect in an .htaccess file or something?

Cheers,
Alex
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: CMS page lookup (301 | 404)???

Post by Eran »

Obviously, a redirect is better than a 404, since for incoming links if google can't find the page that is linked it can't give you any score for it.
There is a third option which you have not mentioned - which is simply loading up the content for the old link as if it still exists if you have found a matching URI in your history table.
The easiest way I can for this is to have a URI-to-page table, which can have any number of URIs for any given page with one of those marked as the 'active' URI (the one currently registered via the CMS). A page would retrieved if any of it's URIs is found from an incoming link.

Entering the permalink manually (with an auto-generated one) is nice to have regardless.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: CMS page lookup (301 | 404)???

Post by alex.barylski »

Obviously, a redirect is better than a 404, since for incoming links if google can't find the page that is linked it can't give you any score for it.
Obviously not "obvious" as I didn't think of that. :P Makes sense though.
There is a third option which you have not mentioned - which is simply loading up the content for the old link as if it still exists if you have found a matching URI in your history table.
I think that is the same principle as the 301, but without the redirect or possibly sending headers.

Basically the problem is, I would need to record changes made to pages in the admin (which is not a problem) in a history file/lookup of some sort. Nothing crazy complex, just a table to resolve old URI's to new ones, this is what your saying, if I understand correctly?

So if a URI was requested that could not be resolved by the page devliery engine it would be assumed the URI changed and a lookup would happen, which would resolve the URI to a new URI?

The more I think about it, that is probalby best for SEO and usability...I'll have to think about it some more.

See, in this regard I am copying WordPress where I think it excels...trying to give the user of the CMS the direct ability to manipulate the URI/permalink.

I just played with WP a bit and I see now how it handles dealing with permalinks...seems you can only update permalinks when you include %postname% in the URI.
The easiest way I can for this is to have a URI-to-page table, which can have any number of URIs for any given page with one of those marked as the 'active' URI (the one currently registered via the CMS). A page would retrieved if any of it's URIs is found from an incoming link.
I hear you...only that table could potentially explode in size...everytime someone modifies the title or whatever attribute they chose to have as a composite in the URI. So if they change the title or alias or keyword, which they should be able to, the URI table would have one more entry. Each time that table is loaded is memory consumed that otherwise wouldn't be required if I could programmatically figure it out.

To give a better idea as to what problems I face (I realize this post is vague and could be interpreted a hundred different ways) a page consists of several attrbutes:

1. Title (shown in <title>)
2. Alias (A short hand name for the page - similar to filename but not quite)
3. Keywords (meta -- useful for blogging tag clouds, etc)
4. Date Published, Expires, etc.
5. Description (meta)
6. Content

And maybe a few others.

Only a limited number of those fields are capable of being used in a URI (obviously looking up a page by it's contents iis impossible via a GET request). For that reason I assume I would only offer:

1. Title
2. Keywords
3. Alias
4. PKID

The problem is, WordPress supports having dates, etc as part of the URI. These are probably not technically used by the lookup engine, but instead server as part of the user interface and are purely aesthetic.

Code: Select all

domain.com/2009/10/23/some-title.html
Unless you rely on the fact that a single post will be made once a day, using a date as a lookup is not appropriate. You need something specific, like a keyword hash or something.
Entering the permalink manually (with an auto-generated one) is nice to have regardless.
I agree...I like how WP does it although it seems to just copy the title initially and then let you edit the permalink manually after that.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Re: CMS page lookup (301 | 404)???

Post by matthijs »

Why try to compose urls with a different set of fields? I would consider the URL as just another field. Just like wordpress does (it's called the "slug" there). Of course, as a default you could start with the title (as does wp)

I don't thing dates are necessary in wp. You can leave them out if you want.

I do believe URLs are very important in a CMS. When I can't have full control over them I don't want to use the CMS. I hate it when a system:
- limits me to section/page
- always has index.php/% in the first part
- always uses the title. what if you have a very long weird title?

But it sure is a difficult part to design right. The more flexible you make it, the more logic to parse the urls you need
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: CMS page lookup (301 | 404)???

Post by alex.barylski »

- limits me to section/page
That was just an example...WordPress has a flexible URI/permalink approach, I'm trying to make my own just a little more robust, while remaining user friendly.
But it sure is a difficult part to design right. The more flexible you make it, the more logic to parse the urls you need
In this way, the system is more like a framework. There is no nasty routing code for custom URI's you simply describe the URI in a high level human readable format and the system does the rest and feeds you the results.

As odd as it may sound, this flexibility is what is causing me issues in emulating WP. Their code is tightly coupled to the interface and vis-versa and many assumptions are made, hard to explain what I mean exactly without going into novel details.

For instance, WP can know for sure that a title field, is a title field. My system, because the URI is parsed autonomously (for lack of a better word) the system only knows there is a title field, but not exactly what the title field is or does. For that matter, it may not even be named title but something altogather different, like header or title-info. While not impossible it requires additional steps in the admin area if I am to allow users to change their permalinks. It also probably means introducing explicit rules as to how and/or what you name URI segments.

Cheers,
Alex
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: CMS page lookup (301 | 404)???

Post by Eran »

I hear you...only that table could potentially explode in size...
Actually, I don't that is a real concern for 99% of the cases. If you think about it, most people/companies would have a few hundred content pages at the very most, and the URL is not edited that often - lets assume an extreme case in which every page had its URL edited an average of 10 times - that's only a few thousands rows. With an index, this table is insiginificant for MySQL. It can grow to several millions rows easily and still be very fast.
User avatar
allspiritseve
DevNet Resident
Posts: 1174
Joined: Thu Mar 06, 2008 8:23 am
Location: Ann Arbor, MI (USA)

Re: CMS page lookup (301 | 404)???

Post by allspiritseve »

PCSpectra wrote:So if a URI was requested that could not be resolved by the page devliery engine it would be assumed the URI changed and a lookup would happen, which would resolve the URI to a new URI?
It would be better for SEO if you had a redirect.
PCSpectra wrote:I hear you...only that table could potentially explode in size...everytime someone modifies the title or whatever attribute they chose to have as a composite in the URI. So if they change the title or alias or keyword, which they should be able to, the URI table would have one more entry. Each time that table is loaded is memory consumed that otherwise wouldn't be required if I could programmatically figure it out.
Thats why you shouldn't use the title directly in the url. I have a different field called name for my pages that is automatically generated when the page is created (based on the title, all undercase, replacing spaces with underscores). It can be edited separately from the title though.

I hadn't thought about a history table for url changes... that's a good idea, as long as it redirects and doesn't just load the new page.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: CMS page lookup (301 | 404)???

Post by alex.barylski »

With an index, this table is insiginificant for MySQL
This is true but I want the router to be as portable as possible and introducing a dependency on MySQL makes for a less flexible design. I hate dependecies...so even if I injected a AdoDB or similar abstraction layer, the code would be dependent on SQL and a abstraction class.

Alternatively I could possibly pass in an adapter instead of the SQL classes (which is best anyway) however the database connection is not establish at this point of execution, as they are lazy loaded objects in the controllers themselves.

Actually, what I could do is implement a binary format where the URI hashes are stored in the first X number of fixed width records and point to the offset of the new URI. Lookups would be fast as hell, but updates would be slower, which is fine considering updates rarely happen.
Thats why you shouldn't use the title directly in the url
I'm thinking maybe your right, but then that applies to all other fields and if that is the case then I should maybe include a "permalink" field which is set at page creation and not mutable. That doesn't make for very flexible URI's -- then again maybe it does. :idea:

Perhaps having content fields being reflected in the URI isn't the best design...hmmm...thats a very interesting point now that you make me think about it. Using a custom permalink, I can let people totally cusotmize the URI structure and be more like WordPress.

Only I need to support static URI's as well so I can do something like:

Code: Select all

this-is-a-permalink.html 
 
blog/rss
blog/list
blog/tag/keyword
Sorry, not really relevant to what is being discussed here but I needed to ramble a bit and hash out some more ideas.

Cheers,
Alex
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: CMS page lookup (301 | 404)???

Post by Eran »

This is true but I want the router to be as portable as possible and introducing a dependency on MySQL makes for a less flexible design. I hate dependecies...so even if I injected a AdoDB or similar abstraction layer, the code would be dependent on SQL and a abstraction class.
I don't see how this is relevant for anything... I wrote MySQL, but any decent modern database should be able to handle several thousands row table with ease, as long as it's properly indexed.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: CMS page lookup (301 | 404)???

Post by alex.barylski »

I don't see how this is relevant for anything... I wrote MySQL, but any decent modern database should be able to handle several thousands row table with ease, as long as it's properly indexed.
It's the dependency more than anything, MySQL or SQLite or whatever. Ideally I want to keep the router as independent as possible but injecting an adapter for table lookups is a decent solution and probably best. The only problem is, I don't have any DB connections established at this point, so MySQL, etc would be out of the question, at least if I wanted to share the same DB with the rest of the application.

I could probably use a custom database like firebird or whatever that DB engine is that specialized in fixed length records.

Again that creates a dependency on a third party library which is probably not available on shared hosts, so a custom solution or a second connection would be required.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: CMS page lookup (301 | 404)???

Post by Eran »

Just use a database abstraction library, there are plenty of those around... and I was referring to your worries on performance more than anything else.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: CMS page lookup (301 | 404)???

Post by alex.barylski »

and I was referring to your worries on performance more than anything else.
Haha...I think were getting confused here.

I was only concerned about performance because I was seeing like an INI lookup table which once past 1000 records it would significantly impact memory and possibly performance. Usually if it's less than 100 records I keep with native files, yes a DB would address those issues.

My primary concern with using a database, is the dependency introduced, whether it be mysql_* API or AdoDB API it's still an outside dependency, unless I use some kind of an adapter, in which case the depdendency on third party software is removed but there still exists a dependency.

Some dependencies are better than others, I guess is how I look at it...it's just a matter of selecting one and going with it. The lookup table is not a bad idea and it's positives (personally) out weight the negatives of introducing a dependency...

Cheers,
Alex
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: CMS page lookup (301 | 404)???

Post by Eran »

I'm somewhat confused - you are considering implementing a real CMS that you will be selling to customers without using a database? what did you consider, implementing your own file-based database? you really think that developers/customers need to maintain this kind of CMS, would worry more about external database dependencies or maintaining your own custom file-based solution to a database? I can tell you right now which I would prefer.

Using the commonly available tools is not considered a dependency in my opinion... what's next, would PHP be considered a dependency?
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: CMS page lookup (301 | 404)???

Post by alex.barylski »

Of course there wqould be database support. :P

However, there is no database connection available at the time of routing and seeing as only a CMS would ever really need such a functionality, it doesn't make much sense to have something like that in the core of the framework, at least not according to my vision.
what did you consider, implementing your own file-based database?
I'm not convinced of anything yet, I still have to think about it, but a server that needs connection data is not as portable as I would ideally like the system to be. Applications would certainly use a database, but they would provide their own connection(s) or at least establish their own connections.
really think that developers/customers need to maintain this kind of CMS, would worry more about external database dependencies or maintaining your own custom file-based solution to a database? I can tell you right now which I would prefer.
Right now, I'm focusing on the framework of the CMS, not so much the user experience, but I agree, the end user wouldn't care, nor would they even know. I can tell you that when I work with a framework and I have to setup dependencies like a database, or whatever I'm usually turned off unless it offers significant advantages.

My focus right now is making sure parts can be added and removed by simply adding/removing files...no tinkering with DB's...not to say I might never go down that path, but for now I'm seeking alternative solutions
Using the commonly available tools is not considered a dependency in my opinion... what's next, would PHP be considered a dependency
Depends on perspective I guess...if Zend required you so store configuration data in a database, would you use it? I certainly wouldn't.

There are times when a database makes sense, from what I understand of my current implementation, I don't think using a DB would make total sense, but the design is still very immature. I might have it so components can override the default routing and therefore move the dependency from the framework into a CMS component, which is really the only system I can think of that might need such mutable permalink functionality. A forum for example, probably wouldn't.

I guess I have a very wide view of what "dependency" means. PHP itself personally, would not be considered a dependency worth worrying about in this case. :P

Cheers,
Alex
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: CMS page lookup (301 | 404)???

Post by josh »

I thought this is why they invented datamappers
Post Reply