File Generating CMS

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

File Generating CMS

Post by Ambush Commander »

I'm a big fan of static HTML files. They're simple, self-contained, easily editable, i.e. don't have very much baggage associated with them. You can easily version control them with SVN, you have complete control over their structure, and they're fast.

But they obviously have their limitations: no dynamic content, no templates, no guards to make sure you write standards-compliant code, no syntax highlighting, etc. To combat this problems, most CMSes out there have taken to redirecting all requests through PHP files, which generate the HTML on the fly and serve it to the browser. If you're lucky, you'll have a single PHP front-controller where all the requests get redirected.

I would like to propose a different approach: use Apache and htaccess as your front controller, and have it serve static HTML files, calling a PHP file to generate the HTML from a source XHTML file whenever some parameter is met.

Example:

You have demo.xhtml, which is your source file. It is a well-formed XHTML document, probably importing a few other namespaces which our post-processor will handle. Let's say it contains some text, uses <q> tags for quotations, contains the website's common header, and has some programming code.

When Apache receives a request for http://www.example.com/demo.html, a file which does not currently exist. So, Apache forwards the request along to main.php, our 404 handler and also our main application entry point. It takes the requested URI and translates it into a source file, tests if the source file exists, and then takes the source and processes it: it canonicalizes URIs from index.xhtml to index.html, it replaces <q> tags with curly quotes, it substitutes in the header, and it runs Geshi on the programming code (all of this done with the help of DOM and XPath). Then, it spits it into the new HTML file and serves the data itself.

The next time a client requests the file, Apache will direct them straight to the static HTML: no fussing about necessary. When the source XHTML gets updated, delete the compiled HTML and let the 404 handler do its magic. You could also set up a cron job to compare filemtime() between the source files, or set up a special GET flag that flushes the cache.

(End example)

This is by no means meant to replace dynamic systems, if you route everything through a PHP file you end up with a standard front-controller with filesystem caching CMS, but using this method we harness the power of Apache and mod_rewrite and let it take care of caching for us, gaining the convenience of PHP processing while removing the overhead.

So... thoughts? Comments? Prior art?
User avatar
serfczar_
Forum Commoner
Posts: 34
Joined: Sun Feb 25, 2007 5:27 pm
Location: USA, Texas
Contact:

Re: File Generating CMS

Post by serfczar_ »

Ambush Commander wrote:I'm a big fan of static HTML files. They're simple, self-contained, easily editable, i.e. don't have very much baggage associated with them. You can easily version control them with SVN, you have complete control over their structure, and they're fast.

But they obviously have their limitations: no dynamic content, no templates, no guards to make sure you write standards-compliant code, no syntax highlighting, etc. To combat this problems, most CMSes out there have taken to redirecting all requests through PHP files, which generate the HTML on the fly and serve it to the browser. If you're lucky, you'll have a single PHP front-controller where all the requests get redirected.

I would like to propose a different approach: use Apache and htaccess as your front controller, and have it serve static HTML files, calling a PHP file to generate the HTML from a source XHTML file whenever some parameter is met.

Example:

You have demo.xhtml, which is your source file. It is a well-formed XHTML document, probably importing a few other namespaces which our post-processor will handle. Let's say it contains some text, uses <q> tags for quotations, contains the website's common header, and has some programming code.

When Apache receives a request for http://www.example.com/demo.html, a file which does not currently exist. So, Apache forwards the request along to main.php, our 404 handler and also our main application entry point. It takes the requested URI and translates it into a source file, tests if the source file exists, and then takes the source and processes it: it canonicalizes URIs from index.xhtml to index.html, it replaces <q> tags with curly quotes, it substitutes in the header, and it runs Geshi on the programming code (all of this done with the help of DOM and XPath). Then, it spits it into the new HTML file and serves the data itself.

The next time a client requests the file, Apache will direct them straight to the static HTML: no fussing about necessary. When the source XHTML gets updated, delete the compiled HTML and let the 404 handler do its magic. You could also set up a cron job to compare filemtime() between the source files, or set up a special GET flag that flushes the cache.

(End example)

This is by no means meant to replace dynamic systems, if you route everything through a PHP file you end up with a standard front-controller with filesystem caching CMS, but using this method we harness the power of Apache and mod_rewrite and let it take care of caching for us, gaining the convenience of PHP processing while removing the overhead.

So... thoughts? Comments? Prior art?
Every server you went to would have to have this included in their apache configuration for ONE cms. I don't think this is a very good idea because it would implement security holes into apache and comprimising servers, and would eliminate the reusability part of php.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Deployment would be limited, yes, but there's no need to noodle around in httpd.conf. .htaccess and mod_rewrite will do the trick.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

I like the idea, I've thought the very same thing in the past, instead of using PHP to check for a cached file use Apache.

Here is the problem that I see:

1) HTML files littered across your root directory or sub directories is difficult to maintain when you get anywhere past 100
2) Clients using FTP will at some time, **** up the resulting HTML file using DreamWeaver

Also, how to do you handle Search requests, etc? You could easily cache listings (realty, etc) so long as you didn't allow dynamic filtering of results, cuase when you do the whole caching thing becomes extremely complicated - maybe I'm just not understanding you totally.

I'm a CMS whore/expert...I've played with and developed dozens so I'm certainly interested in fellow CMS'ers opinions :)

This would limit your CMS functionality I think, but it really depends on your target market. Sounds more like a Blogging system for developers, where you can gaurantee end users don't muckup your markup :P hehe

Geshi is a syntax hilighting engine (if I recall feyd pointing that out a while back?) so I assume this is developer oriented, in which case your idea is bang on as its serving the market nicely. Programmers, like you've made obvious, don't like bloat, we follow KISS and want everything as perfect as possible for our own situation - and is partly why we make such bad businessmen - developing software goes against the golden rule of the customer knows best :P

Anyways, I say groovy dude, definetely persue it, keep us updated...i'll help you here and there if I can assiting you will serve a learning experience to tinker with that HTML engine thingy you developed :)

Cheers :)
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

Actually, I don't think it needs to be limited. It could be done in .htaccess. Maybe I am not fully understanding your scheme, but you could probably do it with simple rewrite rules and a Front Controller that used clean URLs. All is would need to do is check a clean URL to see if the same URL with a .html extension existed. If it did, display the HTML file; If it did not then have the controller run and generate the HTML file. It would even work for paged output and searches (if they used clean URLs). The backend editors would just need to know which files/folders to delete when anything was edited. It is really just a caching system that pushes the caching logic into a simple "does the HTML file exist" rewrite rule.
(#10850)
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

In my experience, not all content can be cached, depends on the complexity of the CMS itself and the content which it is delivering.

I think we all have a similar understanding as to what is desired, using mod_rewrite...

I worked with a proprietary CMS a few months ago which did something like described, of course I cannot for love nor money find it again using Google...I want to QuantraCMS but thats not it :|

Again, in my experience using strictly Apache for delivering content is limiting (obviously by nature of what you can do with Apache as compared to PHP) but also makes organization difficult. Stuffing content in tables is much easier to maintain. :)

It really depends on your target market I guess.

Many moons ago I hacked phpWCMS to dump it's content (stored in tables) to raw HTML files so everything was naturally SEO friendly, as the content didn't change dynamically based on end user input, this worked. There was no cache fetching, refreshing, etc...

I simply had a generate site button which re-created the entire site or particular page when something changed. Thats likely the easiet method, as you get the benefits of easy management, with static content and no need for mod_rewrite...it's kinda like having a web based dreamweaver :P

Cheers :)
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

Sounds like funky caching [PHP tips and tricks (PDF) slide 23.].
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

Buddha443556 wrote:Sounds like funky caching [PHP tips and tricks (PDF) slide 23.].
Yes, and I thought about 404 after I posted above. Although his comments were before clean URLs were the rage. Also this idea would use only clean URLs using mod_rewrite to create a tree of HTML files. So it is not so much an error system as a response cache system -- by other means.
(#10850)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Thanks for the replies. This'll be, for now, an in-house tool, I might release it officially if I find the paradigm works nicely.
Sounds like funky caching [PHP tips and tricks (PDF) slide 23.].
Yup, that's precisely it. I also dug out my old "Advanced PHP Programming" book and it looks like George Schlossnagle came up with the same idea. The only difference is that I'm adopting a "compile the HTML" mentality where the data doesn't come from the database but from another static document.

Actually, there's two different approaches you can take: you can use mod_rewrite and it's !-f condition to test if files exist, or you can just use Apache's ErrorDocument directive.
Yes, and I thought about 404 after I posted above. Although his comments were before clean URLs were the rage. Also this idea would use only clean URLs using mod_rewrite to create a tree of HTML files. So it is not so much an error system as a response cache system -- by other means.
Correct. An added bonus is that you could use mod_rewrite for clean URLs.
1) HTML files littered across your root directory or sub directories is difficult to maintain when you get anywhere past 100
Yes, that is somewhat troublesome. If you want to mass-delete all the files (i.e. clear the cache), you've gotta be really careful.
2) Clients using FTP will at some time, **** up the resulting HTML file using DreamWeaver
Since I think it'll be just developers using this, I don't think it would be to much of a problem. We could always use a crc or md5 to ensure no local modifications had been made to the generated HTML, although I don't know where you would stash the checksum.
Also, how to do you handle Search requests, etc? You could easily cache listings (realty, etc) so long as you didn't allow dynamic filtering of results, cuase when you do the whole caching thing becomes extremely complicated - maybe I'm just not understanding you totally.
That's tough. I don't think it's going to be an option: the page's too dynamic, and I've never been fond of MySQL's FULLTEXT search.
This would limit your CMS functionality I think, but it really depends on your target market. Sounds more like a Blogging system for developers, where you can gaurantee end users don't muckup your markup :-P hehe

Geshi is a syntax hilighting engine (if I recall feyd pointing that out a while back?) so I assume this is developer oriented, in which case your idea is bang on as its serving the market nicely. Programmers, like you've made obvious, don't like bloat, we follow KISS and want everything as perfect as possible for our own situation - and is partly why we make such bad businessmen - developing software goes against the golden rule of the customer knows best :-P
You hit it spot-on. This is not meant to replace things like WordPress, MediaWiki, or the other heavyweight CMSes. This is for web-developers who have begun to butt heads against the limitations of pure HTML but don't want to install a full system (and have to update it too!)
Actually, I don't think it needs to be limited. It could be done in .htaccess. Maybe I am not fully understanding your scheme, but you could probably do it with simple rewrite rules and a Front Controller that used clean URLs. All is would need to do is check a clean URL to see if the same URL with a .html extension existed. If it did, display the HTML file; If it did not then have the controller run and generate the HTML file. It would even work for paged output and searches (if they used clean URLs). The backend editors would just need to know which files/folders to delete when anything was edited. It is really just a caching system that pushes the caching logic into a simple "does the HTML file exist" rewrite rule.
Yep, although searches will be a little clunky due to the speed in which they get invalidated.
In my experience, not all content can be cached, depends on the complexity of the CMS itself and the content which it is delivering.
You're absolutely right. If the output is dynamic enough, you gain no benefit from adding caching to the system, so send it straight to a PHP file.
Again, in my experience using strictly Apache for delivering content is limiting (obviously by nature of what you can do with Apache as compared to PHP) but also makes organization difficult. Stuffing content in tables is much easier to maintain.
Yes. The mod_rewrite fu is going to be the toughest part of doing an endeavor like this. However, as I noted above, mod_rewrite is not strictly necessary.

A little bit on prior art, I think imageboards generate HTML to be served, although there's no dynamic creation of pages.

Well... time to get coding.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

Get at it then *cracks whip* :P

I love watching CMS's evolve...it's probably my favourite thing to program/develope/test/play/tweak, etc...and to think I get paid for it too :P

Anyways, for sure keep us posted...or me posted anyways... :)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I've decided to call it XHTML Compiler. I finished the scaffolding today, and it's really big :-( 16.2 KB of code just to implement the caching system. I also had to implement an htaccess writer, which I didn't want to do but otherwise it's impossible to keep the mod_rewrite rules in sync. It's all procedural code... :oops: so I am going to have a fun time trying to unit test it (help me out in that Singleton God Object thread!)

Currently, the implementation is thus:

You register the XHTML Compiler into certain directories, where you want it to intercept calls to non-existent HTML files. When someone requests an HTML file that doesn't exist, mod_rewrite will redirect it to the generation script. It will check to see if there is a source XHTML file, and if process it, and then output it both to the requested HTML file and the user's browser. Next call, the user gets served a real HTML file.

Fun stuff: there is an update.php script which lets you do various housekeeping work for all the cached HTML files: you can force them all to update or you can clear them all out (sort of like a smart rm -r *.html). By default, it only updates pages whose source file has a later modification time than the cache file (it's stale), but using command line you can trigger the other behavior.

Now, the fun part: making transformations to the text.
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Post by Mordred »

I am far from this field, and I have no practical experience on the matter, so take with a pinch of salt. (I am still pretty sure I am right though! ;) )

It sounds to me that you are trying to reinvent caching, which is already well-defined in HTTP. Let dynamic pages specify proper caching headers, put a reverse squid proxy in front of the server and you're done. I think this is what big portals do anyway.

If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

That's a very good point. Using a reverse proxy can cache dynamic content for you. In fact, we've had times in the past where we've been flaming the proxy/network admin for exactly that happening when really, we just needed to make sure we were expiring content appropriately.
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

Mordred wrote:If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
That's what I've been doing with Perl. Though I like the idea of funky caching, I've never wanted to (or had the guts to) try it on a shared server. Generating hundreds of pages dynamically is what I'm trying to avoid with static pages in the first place. I certainly would like to hear some result of using the XHTML Compiler though.

Have you looked into the possible race condition? I vaguely remember reading about that somewhere.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

It sounds to me that you are trying to reinvent caching, which is already well-defined in HTTP. Let dynamic pages specify proper caching headers, put a reverse squid proxy in front of the server and you're done. I think this is what big portals do anyway.
That's a very good point. Using a reverse proxy can cache dynamic content for you.
Yeah, I think the exact same effect could be achieved with a few smart headers and a bunch of Squid proxies. So... I'll appeal to the fact that not everyone has a dedicated server and a reverse proxy to do something like that. :-P

Making the proper cache headers is actually quite difficult to do, most open-source web applications get it all wrong. It's quite disappointing, but using Apache all this gets done for you: not modified headers and ETags. Mmm...
If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
Ooh, Fog Creek Software. Bound to be good. I do server-side compilation, CityDesk does client-side.
Have you looked into the possible race condition? I vaguely remember reading about that somewhere.
From the reading of done, it seems that it's not that big of a problem. If you've got a heavy access website, the first time the page has to be generated, all requests will try to generate the page until one of them finishes and caches it. It might get overwritten several times in the process. But that's for heavy contention websites.
Post Reply