multi-format text parser
Posted: Sat Apr 05, 2008 3:16 pm
Hi all,
I'm currently in the middle of a refactor on my application - a PHP message centre. It's a kind of distribution hub for messages from a variety of sources and destinations, which might be in any number of formats. For example, one simple use case might be to take a phpBB forum post as an input, and send the message in an HTML email, with all the formatting and attachments of the original.
I have already written a tokenizer/lexer to convert BBCode (used by phpBB) into HTML and/or plain text. However, I'm considering supporting other formats such as Wiki formatting, Markdown and similar. This has got me wondering how feasible it would be to make a single parser which can take any format and output any other, rather than separate parsers such as BBCode->HTML, Markdown->BBCode, etc.
On the face of it, it's a lot of work but I'm not sure how impossibly complex it might be. It means tokenizing the input based on a set of rules for the input format, with each token given a generic rather than a proprietary name, and then reassembling based on rules for the output format.
I'd like some advice and observations about how I could do this, what the pitfalls might be, and any existing solutions which might help. Since HTMLPurifier seems so stable and successful, I have been wondering in the back of my mind if I might tap into its code somehow and build upon it as a framework. I haven't looked closely enough at the code yet to determine if this might be possible.
Many thanks for any comments!
I'm currently in the middle of a refactor on my application - a PHP message centre. It's a kind of distribution hub for messages from a variety of sources and destinations, which might be in any number of formats. For example, one simple use case might be to take a phpBB forum post as an input, and send the message in an HTML email, with all the formatting and attachments of the original.
I have already written a tokenizer/lexer to convert BBCode (used by phpBB) into HTML and/or plain text. However, I'm considering supporting other formats such as Wiki formatting, Markdown and similar. This has got me wondering how feasible it would be to make a single parser which can take any format and output any other, rather than separate parsers such as BBCode->HTML, Markdown->BBCode, etc.
On the face of it, it's a lot of work but I'm not sure how impossibly complex it might be. It means tokenizing the input based on a set of rules for the input format, with each token given a generic rather than a proprietary name, and then reassembling based on rules for the output format.
I'd like some advice and observations about how I could do this, what the pitfalls might be, and any existing solutions which might help. Since HTMLPurifier seems so stable and successful, I have been wondering in the back of my mind if I might tap into its code somehow and build upon it as a framework. I haven't looked closely enough at the code yet to determine if this might be possible.
Many thanks for any comments!