File Generating CMS
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
File Generating CMS
I'm a big fan of static HTML files. They're simple, self-contained, easily editable, i.e. don't have very much baggage associated with them. You can easily version control them with SVN, you have complete control over their structure, and they're fast.
But they obviously have their limitations: no dynamic content, no templates, no guards to make sure you write standards-compliant code, no syntax highlighting, etc. To combat this problems, most CMSes out there have taken to redirecting all requests through PHP files, which generate the HTML on the fly and serve it to the browser. If you're lucky, you'll have a single PHP front-controller where all the requests get redirected.
I would like to propose a different approach: use Apache and htaccess as your front controller, and have it serve static HTML files, calling a PHP file to generate the HTML from a source XHTML file whenever some parameter is met.
Example:
You have demo.xhtml, which is your source file. It is a well-formed XHTML document, probably importing a few other namespaces which our post-processor will handle. Let's say it contains some text, uses <q> tags for quotations, contains the website's common header, and has some programming code.
When Apache receives a request for http://www.example.com/demo.html, a file which does not currently exist. So, Apache forwards the request along to main.php, our 404 handler and also our main application entry point. It takes the requested URI and translates it into a source file, tests if the source file exists, and then takes the source and processes it: it canonicalizes URIs from index.xhtml to index.html, it replaces <q> tags with curly quotes, it substitutes in the header, and it runs Geshi on the programming code (all of this done with the help of DOM and XPath). Then, it spits it into the new HTML file and serves the data itself.
The next time a client requests the file, Apache will direct them straight to the static HTML: no fussing about necessary. When the source XHTML gets updated, delete the compiled HTML and let the 404 handler do its magic. You could also set up a cron job to compare filemtime() between the source files, or set up a special GET flag that flushes the cache.
(End example)
This is by no means meant to replace dynamic systems, if you route everything through a PHP file you end up with a standard front-controller with filesystem caching CMS, but using this method we harness the power of Apache and mod_rewrite and let it take care of caching for us, gaining the convenience of PHP processing while removing the overhead.
So... thoughts? Comments? Prior art?
But they obviously have their limitations: no dynamic content, no templates, no guards to make sure you write standards-compliant code, no syntax highlighting, etc. To combat this problems, most CMSes out there have taken to redirecting all requests through PHP files, which generate the HTML on the fly and serve it to the browser. If you're lucky, you'll have a single PHP front-controller where all the requests get redirected.
I would like to propose a different approach: use Apache and htaccess as your front controller, and have it serve static HTML files, calling a PHP file to generate the HTML from a source XHTML file whenever some parameter is met.
Example:
You have demo.xhtml, which is your source file. It is a well-formed XHTML document, probably importing a few other namespaces which our post-processor will handle. Let's say it contains some text, uses <q> tags for quotations, contains the website's common header, and has some programming code.
When Apache receives a request for http://www.example.com/demo.html, a file which does not currently exist. So, Apache forwards the request along to main.php, our 404 handler and also our main application entry point. It takes the requested URI and translates it into a source file, tests if the source file exists, and then takes the source and processes it: it canonicalizes URIs from index.xhtml to index.html, it replaces <q> tags with curly quotes, it substitutes in the header, and it runs Geshi on the programming code (all of this done with the help of DOM and XPath). Then, it spits it into the new HTML file and serves the data itself.
The next time a client requests the file, Apache will direct them straight to the static HTML: no fussing about necessary. When the source XHTML gets updated, delete the compiled HTML and let the 404 handler do its magic. You could also set up a cron job to compare filemtime() between the source files, or set up a special GET flag that flushes the cache.
(End example)
This is by no means meant to replace dynamic systems, if you route everything through a PHP file you end up with a standard front-controller with filesystem caching CMS, but using this method we harness the power of Apache and mod_rewrite and let it take care of caching for us, gaining the convenience of PHP processing while removing the overhead.
So... thoughts? Comments? Prior art?
Re: File Generating CMS
Every server you went to would have to have this included in their apache configuration for ONE cms. I don't think this is a very good idea because it would implement security holes into apache and comprimising servers, and would eliminate the reusability part of php.Ambush Commander wrote:I'm a big fan of static HTML files. They're simple, self-contained, easily editable, i.e. don't have very much baggage associated with them. You can easily version control them with SVN, you have complete control over their structure, and they're fast.
But they obviously have their limitations: no dynamic content, no templates, no guards to make sure you write standards-compliant code, no syntax highlighting, etc. To combat this problems, most CMSes out there have taken to redirecting all requests through PHP files, which generate the HTML on the fly and serve it to the browser. If you're lucky, you'll have a single PHP front-controller where all the requests get redirected.
I would like to propose a different approach: use Apache and htaccess as your front controller, and have it serve static HTML files, calling a PHP file to generate the HTML from a source XHTML file whenever some parameter is met.
Example:
You have demo.xhtml, which is your source file. It is a well-formed XHTML document, probably importing a few other namespaces which our post-processor will handle. Let's say it contains some text, uses <q> tags for quotations, contains the website's common header, and has some programming code.
When Apache receives a request for http://www.example.com/demo.html, a file which does not currently exist. So, Apache forwards the request along to main.php, our 404 handler and also our main application entry point. It takes the requested URI and translates it into a source file, tests if the source file exists, and then takes the source and processes it: it canonicalizes URIs from index.xhtml to index.html, it replaces <q> tags with curly quotes, it substitutes in the header, and it runs Geshi on the programming code (all of this done with the help of DOM and XPath). Then, it spits it into the new HTML file and serves the data itself.
The next time a client requests the file, Apache will direct them straight to the static HTML: no fussing about necessary. When the source XHTML gets updated, delete the compiled HTML and let the 404 handler do its magic. You could also set up a cron job to compare filemtime() between the source files, or set up a special GET flag that flushes the cache.
(End example)
This is by no means meant to replace dynamic systems, if you route everything through a PHP file you end up with a standard front-controller with filesystem caching CMS, but using this method we harness the power of Apache and mod_rewrite and let it take care of caching for us, gaining the convenience of PHP processing while removing the overhead.
So... thoughts? Comments? Prior art?
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
I like the idea, I've thought the very same thing in the past, instead of using PHP to check for a cached file use Apache.
Here is the problem that I see:
1) HTML files littered across your root directory or sub directories is difficult to maintain when you get anywhere past 100
2) Clients using FTP will at some time, **** up the resulting HTML file using DreamWeaver
Also, how to do you handle Search requests, etc? You could easily cache listings (realty, etc) so long as you didn't allow dynamic filtering of results, cuase when you do the whole caching thing becomes extremely complicated - maybe I'm just not understanding you totally.
I'm a CMS whore/expert...I've played with and developed dozens so I'm certainly interested in fellow CMS'ers opinions
This would limit your CMS functionality I think, but it really depends on your target market. Sounds more like a Blogging system for developers, where you can gaurantee end users don't muckup your markup
hehe
Geshi is a syntax hilighting engine (if I recall feyd pointing that out a while back?) so I assume this is developer oriented, in which case your idea is bang on as its serving the market nicely. Programmers, like you've made obvious, don't like bloat, we follow KISS and want everything as perfect as possible for our own situation - and is partly why we make such bad businessmen - developing software goes against the golden rule of the customer knows best
Anyways, I say groovy dude, definetely persue it, keep us updated...i'll help you here and there if I can assiting you will serve a learning experience to tinker with that HTML engine thingy you developed
Cheers
Here is the problem that I see:
1) HTML files littered across your root directory or sub directories is difficult to maintain when you get anywhere past 100
2) Clients using FTP will at some time, **** up the resulting HTML file using DreamWeaver
Also, how to do you handle Search requests, etc? You could easily cache listings (realty, etc) so long as you didn't allow dynamic filtering of results, cuase when you do the whole caching thing becomes extremely complicated - maybe I'm just not understanding you totally.
I'm a CMS whore/expert...I've played with and developed dozens so I'm certainly interested in fellow CMS'ers opinions
This would limit your CMS functionality I think, but it really depends on your target market. Sounds more like a Blogging system for developers, where you can gaurantee end users don't muckup your markup
Geshi is a syntax hilighting engine (if I recall feyd pointing that out a while back?) so I assume this is developer oriented, in which case your idea is bang on as its serving the market nicely. Programmers, like you've made obvious, don't like bloat, we follow KISS and want everything as perfect as possible for our own situation - and is partly why we make such bad businessmen - developing software goes against the golden rule of the customer knows best
Anyways, I say groovy dude, definetely persue it, keep us updated...i'll help you here and there if I can assiting you will serve a learning experience to tinker with that HTML engine thingy you developed
Cheers
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Actually, I don't think it needs to be limited. It could be done in .htaccess. Maybe I am not fully understanding your scheme, but you could probably do it with simple rewrite rules and a Front Controller that used clean URLs. All is would need to do is check a clean URL to see if the same URL with a .html extension existed. If it did, display the HTML file; If it did not then have the controller run and generate the HTML file. It would even work for paged output and searches (if they used clean URLs). The backend editors would just need to know which files/folders to delete when anything was edited. It is really just a caching system that pushes the caching logic into a simple "does the HTML file exist" rewrite rule.
(#10850)
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
In my experience, not all content can be cached, depends on the complexity of the CMS itself and the content which it is delivering.
I think we all have a similar understanding as to what is desired, using mod_rewrite...
I worked with a proprietary CMS a few months ago which did something like described, of course I cannot for love nor money find it again using Google...I want to QuantraCMS but thats not it
Again, in my experience using strictly Apache for delivering content is limiting (obviously by nature of what you can do with Apache as compared to PHP) but also makes organization difficult. Stuffing content in tables is much easier to maintain.
It really depends on your target market I guess.
Many moons ago I hacked phpWCMS to dump it's content (stored in tables) to raw HTML files so everything was naturally SEO friendly, as the content didn't change dynamically based on end user input, this worked. There was no cache fetching, refreshing, etc...
I simply had a generate site button which re-created the entire site or particular page when something changed. Thats likely the easiet method, as you get the benefits of easy management, with static content and no need for mod_rewrite...it's kinda like having a web based dreamweaver
Cheers
I think we all have a similar understanding as to what is desired, using mod_rewrite...
I worked with a proprietary CMS a few months ago which did something like described, of course I cannot for love nor money find it again using Google...I want to QuantraCMS but thats not it
Again, in my experience using strictly Apache for delivering content is limiting (obviously by nature of what you can do with Apache as compared to PHP) but also makes organization difficult. Stuffing content in tables is much easier to maintain.
It really depends on your target market I guess.
Many moons ago I hacked phpWCMS to dump it's content (stored in tables) to raw HTML files so everything was naturally SEO friendly, as the content didn't change dynamically based on end user input, this worked. There was no cache fetching, refreshing, etc...
I simply had a generate site button which re-created the entire site or particular page when something changed. Thats likely the easiet method, as you get the benefits of easy management, with static content and no need for mod_rewrite...it's kinda like having a web based dreamweaver
Cheers
- Buddha443556
- Forum Regular
- Posts: 873
- Joined: Fri Mar 19, 2004 1:51 pm
Sounds like funky caching [PHP tips and tricks (PDF) slide 23.].
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Yes, and I thought about 404 after I posted above. Although his comments were before clean URLs were the rage. Also this idea would use only clean URLs using mod_rewrite to create a tree of HTML files. So it is not so much an error system as a response cache system -- by other means.Buddha443556 wrote:Sounds like funky caching [PHP tips and tricks (PDF) slide 23.].
(#10850)
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Thanks for the replies. This'll be, for now, an in-house tool, I might release it officially if I find the paradigm works nicely.
Actually, there's two different approaches you can take: you can use mod_rewrite and it's !-f condition to test if files exist, or you can just use Apache's ErrorDocument directive.
A little bit on prior art, I think imageboards generate HTML to be served, although there's no dynamic creation of pages.
Well... time to get coding.
Yup, that's precisely it. I also dug out my old "Advanced PHP Programming" book and it looks like George Schlossnagle came up with the same idea. The only difference is that I'm adopting a "compile the HTML" mentality where the data doesn't come from the database but from another static document.Sounds like funky caching [PHP tips and tricks (PDF) slide 23.].
Actually, there's two different approaches you can take: you can use mod_rewrite and it's !-f condition to test if files exist, or you can just use Apache's ErrorDocument directive.
Correct. An added bonus is that you could use mod_rewrite for clean URLs.Yes, and I thought about 404 after I posted above. Although his comments were before clean URLs were the rage. Also this idea would use only clean URLs using mod_rewrite to create a tree of HTML files. So it is not so much an error system as a response cache system -- by other means.
Yes, that is somewhat troublesome. If you want to mass-delete all the files (i.e. clear the cache), you've gotta be really careful.1) HTML files littered across your root directory or sub directories is difficult to maintain when you get anywhere past 100
Since I think it'll be just developers using this, I don't think it would be to much of a problem. We could always use a crc or md5 to ensure no local modifications had been made to the generated HTML, although I don't know where you would stash the checksum.2) Clients using FTP will at some time, **** up the resulting HTML file using DreamWeaver
That's tough. I don't think it's going to be an option: the page's too dynamic, and I've never been fond of MySQL's FULLTEXT search.Also, how to do you handle Search requests, etc? You could easily cache listings (realty, etc) so long as you didn't allow dynamic filtering of results, cuase when you do the whole caching thing becomes extremely complicated - maybe I'm just not understanding you totally.
You hit it spot-on. This is not meant to replace things like WordPress, MediaWiki, or the other heavyweight CMSes. This is for web-developers who have begun to butt heads against the limitations of pure HTML but don't want to install a full system (and have to update it too!)This would limit your CMS functionality I think, but it really depends on your target market. Sounds more like a Blogging system for developers, where you can gaurantee end users don't muckup your markuphehe
Geshi is a syntax hilighting engine (if I recall feyd pointing that out a while back?) so I assume this is developer oriented, in which case your idea is bang on as its serving the market nicely. Programmers, like you've made obvious, don't like bloat, we follow KISS and want everything as perfect as possible for our own situation - and is partly why we make such bad businessmen - developing software goes against the golden rule of the customer knows best
Yep, although searches will be a little clunky due to the speed in which they get invalidated.Actually, I don't think it needs to be limited. It could be done in .htaccess. Maybe I am not fully understanding your scheme, but you could probably do it with simple rewrite rules and a Front Controller that used clean URLs. All is would need to do is check a clean URL to see if the same URL with a .html extension existed. If it did, display the HTML file; If it did not then have the controller run and generate the HTML file. It would even work for paged output and searches (if they used clean URLs). The backend editors would just need to know which files/folders to delete when anything was edited. It is really just a caching system that pushes the caching logic into a simple "does the HTML file exist" rewrite rule.
You're absolutely right. If the output is dynamic enough, you gain no benefit from adding caching to the system, so send it straight to a PHP file.In my experience, not all content can be cached, depends on the complexity of the CMS itself and the content which it is delivering.
Yes. The mod_rewrite fu is going to be the toughest part of doing an endeavor like this. However, as I noted above, mod_rewrite is not strictly necessary.Again, in my experience using strictly Apache for delivering content is limiting (obviously by nature of what you can do with Apache as compared to PHP) but also makes organization difficult. Stuffing content in tables is much easier to maintain.
A little bit on prior art, I think imageboards generate HTML to be served, although there's no dynamic creation of pages.
Well... time to get coding.
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
I've decided to call it XHTML Compiler. I finished the scaffolding today, and it's really big
16.2 KB of code just to implement the caching system. I also had to implement an htaccess writer, which I didn't want to do but otherwise it's impossible to keep the mod_rewrite rules in sync. It's all procedural code...
so I am going to have a fun time trying to unit test it (help me out in that Singleton God Object thread!)
Currently, the implementation is thus:
You register the XHTML Compiler into certain directories, where you want it to intercept calls to non-existent HTML files. When someone requests an HTML file that doesn't exist, mod_rewrite will redirect it to the generation script. It will check to see if there is a source XHTML file, and if process it, and then output it both to the requested HTML file and the user's browser. Next call, the user gets served a real HTML file.
Fun stuff: there is an update.php script which lets you do various housekeeping work for all the cached HTML files: you can force them all to update or you can clear them all out (sort of like a smart rm -r *.html). By default, it only updates pages whose source file has a later modification time than the cache file (it's stale), but using command line you can trigger the other behavior.
Now, the fun part: making transformations to the text.
Currently, the implementation is thus:
You register the XHTML Compiler into certain directories, where you want it to intercept calls to non-existent HTML files. When someone requests an HTML file that doesn't exist, mod_rewrite will redirect it to the generation script. It will check to see if there is a source XHTML file, and if process it, and then output it both to the requested HTML file and the user's browser. Next call, the user gets served a real HTML file.
Fun stuff: there is an update.php script which lets you do various housekeeping work for all the cached HTML files: you can force them all to update or you can clear them all out (sort of like a smart rm -r *.html). By default, it only updates pages whose source file has a later modification time than the cache file (it's stale), but using command line you can trigger the other behavior.
Now, the fun part: making transformations to the text.
I am far from this field, and I have no practical experience on the matter, so take with a pinch of salt. (I am still pretty sure I am right though!
)
It sounds to me that you are trying to reinvent caching, which is already well-defined in HTTP. Let dynamic pages specify proper caching headers, put a reverse squid proxy in front of the server and you're done. I think this is what big portals do anyway.
If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
It sounds to me that you are trying to reinvent caching, which is already well-defined in HTTP. Let dynamic pages specify proper caching headers, put a reverse squid proxy in front of the server and you're done. I think this is what big portals do anyway.
If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
- Chris Corbyn
- Breakbeat Nuttzer
- Posts: 13098
- Joined: Wed Mar 24, 2004 7:57 am
- Location: Melbourne, Australia
- Buddha443556
- Forum Regular
- Posts: 873
- Joined: Fri Mar 19, 2004 1:51 pm
That's what I've been doing with Perl. Though I like the idea of funky caching, I've never wanted to (or had the guts to) try it on a shared server. Generating hundreds of pages dynamically is what I'm trying to avoid with static pages in the first place. I certainly would like to hear some result of using the XHTML Compiler though.Mordred wrote:If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
Have you looked into the possible race condition? I vaguely remember reading about that somewhere.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
It sounds to me that you are trying to reinvent caching, which is already well-defined in HTTP. Let dynamic pages specify proper caching headers, put a reverse squid proxy in front of the server and you're done. I think this is what big portals do anyway.
Yeah, I think the exact same effect could be achieved with a few smart headers and a bunch of Squid proxies. So... I'll appeal to the fact that not everyone has a dedicated server and a reverse proxy to do something like that.That's a very good point. Using a reverse proxy can cache dynamic content for you.
Making the proper cache headers is actually quite difficult to do, most open-source web applications get it all wrong. It's quite disappointing, but using Apache all this gets done for you: not modified headers and ETags. Mmm...
Ooh, Fog Creek Software. Bound to be good. I do server-side compilation, CityDesk does client-side.If you're commited to your idea, you may take a look at CityDesk, a desktop software for dynamic-static content generation. You write templates and content, the program generates static html and uploads them to your server.
From the reading of done, it seems that it's not that big of a problem. If you've got a heavy access website, the first time the page has to be generated, all requests will try to generate the page until one of them finishes and caches it. It might get overwritten several times in the process. But that's for heavy contention websites.Have you looked into the possible race condition? I vaguely remember reading about that somewhere.