Parsing Custom Markup with regex
Posted: Mon Jul 05, 2010 5:43 am
I posted last year while trying to figure out how to create auto-updating "glossary" links on my site. Since then I've started using "custom markup" for many things, not just "glossary" words. But I'm not srue I'm doing it the best way. It may be turning into something more like XML, and XML might be a better way to do it?
Anyway the syntax is similar to wikitext, the "tags" are written like this:
In all cases, a script parses the content of a page and replaces the [[tags]] with correctly formatted html. Like in the [[image]] example, it returns a div aligned the specified way with an img tag inside and a caption displayed under it. Later if I want to change the way the images are displayed I just change the parsing script and all of the images on every page can be re-written to the new html (like if i want to put the caption on top, or make all images be centered regardless of the specified "align" attribute)
The script uses preg_match to get an array fo [[tags]], rinspects them one at a time, creates the html version and then replaces the [[tags]] with the html versions. Some of the maching can be very involved with all the attributes etc.
Many of the "tags" are things that don't change often (if at all), such as ones that just make things into a divs or whatever. The problem comes with the ones that can be expected to change often, like the "glossary" tags I originally wanted to use. Say there's a page with this content:
To do this, currently I'm "caching" the unparsed page contents by running it through the script and saving it as a "parsed" file. If something changes, I manually go back and re-parse the contents to create an updated "chached" file.
The problem is that this is becoming very tedious and isn't very reliable (because I don't always remember to do it
). So I was wondering if this is even necessary?
Would it be too taxing on the server to run a moderately complaed preg_match and replace on the contents of every page on the fly every time someone looks at it? (My site gets several thousand views a month and as I said some of the matching is rather complicated, there can be many many tags on a given page, plus the content files can be 10k or more)
Or is it best to continue "caching" the pages the way I do now?
Or could I split the difference, cache the things that don't change much and leave the glossary tags etc. unpared and only parse those on the fly every time the page is viewed?
Anyway the syntax is similar to wikitext, the "tags" are written like this:
Some of them are very simple with no "attributes", like [[note|Here's the note]] (which might render as a div containing the text after the pipe, styled with the words "Here's a note" written above it), whereas others are more complicated, like the [[image]] example with lots of "attributes".[[image src="pic.gif" align="right"|This is the caption]]
In all cases, a script parses the content of a page and replaces the [[tags]] with correctly formatted html. Like in the [[image]] example, it returns a div aligned the specified way with an img tag inside and a caption displayed under it. Later if I want to change the way the images are displayed I just change the parsing script and all of the images on every page can be re-written to the new html (like if i want to put the caption on top, or make all images be centered regardless of the specified "align" attribute)
The script uses preg_match to get an array fo [[tags]], rinspects them one at a time, creates the html version and then replaces the [[tags]] with the html versions. Some of the maching can be very involved with all the attributes etc.
Many of the "tags" are things that don't change often (if at all), such as ones that just make things into a divs or whatever. The problem comes with the ones that can be expected to change often, like the "glossary" tags I originally wanted to use. Say there's a page with this content:
And say that currently there is no glossary entry for the word "car". So when someone looks at the page, they would see a link on the word "dog" that goes to a glossary entry for that word, but the word "car" would just appear normal. However, later when the word "car" gets a glossary entry, the word "car" in the example should automatically become a link to that word.[[note|My friend recently bought a [[glossary|car]] and a [[glossary|dog]].]]
To do this, currently I'm "caching" the unparsed page contents by running it through the script and saving it as a "parsed" file. If something changes, I manually go back and re-parse the contents to create an updated "chached" file.
The problem is that this is becoming very tedious and isn't very reliable (because I don't always remember to do it
Would it be too taxing on the server to run a moderately complaed preg_match and replace on the contents of every page on the fly every time someone looks at it? (My site gets several thousand views a month and as I said some of the matching is rather complicated, there can be many many tags on a given page, plus the content files can be 10k or more)
Or is it best to continue "caching" the pages the way I do now?
Or could I split the difference, cache the things that don't change much and leave the glossary tags etc. unpared and only parse those on the fly every time the page is viewed?