Page 1 of 1

multi-format text parser

Posted: Sat Apr 05, 2008 3:16 pm
by georgeoc
Hi all,

I'm currently in the middle of a refactor on my application - a PHP message centre. It's a kind of distribution hub for messages from a variety of sources and destinations, which might be in any number of formats. For example, one simple use case might be to take a phpBB forum post as an input, and send the message in an HTML email, with all the formatting and attachments of the original.

I have already written a tokenizer/lexer to convert BBCode (used by phpBB) into HTML and/or plain text. However, I'm considering supporting other formats such as Wiki formatting, Markdown and similar. This has got me wondering how feasible it would be to make a single parser which can take any format and output any other, rather than separate parsers such as BBCode->HTML, Markdown->BBCode, etc.

On the face of it, it's a lot of work but I'm not sure how impossibly complex it might be. It means tokenizing the input based on a set of rules for the input format, with each token given a generic rather than a proprietary name, and then reassembling based on rules for the output format.

I'd like some advice and observations about how I could do this, what the pitfalls might be, and any existing solutions which might help. Since HTMLPurifier seems so stable and successful, I have been wondering in the back of my mind if I might tap into its code somehow and build upon it as a framework. I haven't looked closely enough at the code yet to determine if this might be possible.

Many thanks for any comments!

Re: multi-format text parser

Posted: Sun Apr 06, 2008 9:47 pm
by Ambush Commander
The first step in doing something like this is determining what the "interchange" format will be, so that you need 2N converters: converting the interchange format to N formats, and converting those N formats to the interchange format. Otherwise, you need N^2, which, while efficient, will not be pleasant to code. You mention defining your own format--while this makes me leery, it's highly unlikely that you'll be able to find a format that is a strict superset of all the languages you want to support (wikitext, bbcode, markdown, HTML). Heck, if you want to implement Wikipedia's wikitext you're SOL.

That being said, I think HTML, or at least some XML-based language, is your best bet, since it can be DOM-ized.

As for HTML Purifier, its codebase is built off the assumption of filtering data. Your cross-format hub will still be in need of security filtering, so it can be useful there, but besides the general purpose UTF-8 and other functions, and maybe our tokenizers for HTML, I don't know what else you'd be able to build off of.

Expect to write and rewrite code. Doing a one-way translation is hard enough. :-)

Re: multi-format text parser

Posted: Fri Apr 25, 2008 8:49 pm
by georgeoc
OK, so I've been doing some research on this tonight. Here's where I've got to so far:

- It seems like a flavour of XML is the obvious interchange format. In the early stages, I must support BBCode, HTML, Markdown and plain text, and XHTML seems to be the natural superset.

- I need to convert incoming messages into XML. For both BBCode and Markdown I would need to write or adapt a parser, with HTML I would imagine I could use HTMLPurifier to ensure it's valid XML and to remove malicious code.

- The XML interchange format would effectively be XHTML with a limited whitelist of tags - similar to the standard phpBB set of BBCode tags. My users won't need more complex tags, so I will set HTMLPurifier to strip these from incoming HTML messages.

- I have been looking into XSL for the first time, and it seems that if I define a standardised XML format as above, I can use a separate XSL template for each of the output formats I need. The interchange XML should be valid XHTML, so that needs no further processing.

- The question now for me, before I get cracking, is about the parsers for BBCode -> XML and Markdown -> XML. I've done a bit of research and found an array of information, including an old article by Harry Fuecks which recommends the PEAR Text_Wiki library. I'm keen to find a solution which is future-proof, so want to use a Parser/Lexer which can be extended for each input format I want to support. I'd like to avoid using a separate solution for each format!

- is it crazy to be using a Lexer to split a BBCode string into tokens, parse them and reassemble into the XML interchange format, only to pass the XMl to an XSL template to be transformed into another format? That seems like two costly operations where there should be only one. Should I instead use a Lexer to get an array of tokens and use that as the interchange format? Then I loose the power of XSL, which seems a shame.

Re: multi-format text parser

Posted: Sat Apr 26, 2008 12:51 am
by matthijs
Hi georgeoc, have you seen the TDD thread here
"As an education experience we're going to run a (almost) live TDD session here on the forum. The end-product we're looking to produce is an abstraction library for reading and writing data formats. We'll set out to support at least three formats initially and then see how many we can support beyond that.

Requirements

Create a library for reading and writing tree-like data structures in the simplest way possible. The library should be able to read data in one format and write it back out in a different format."
It is mostly an exercise for a few of us to learn TDD, so we are not that far into any code yet, but might be interesting to follow anyway?

Re: multi-format text parser

Posted: Sat Apr 26, 2008 5:37 am
by georgeoc
matthijs wrote:Hi georgeoc, have you seen the TDD thread here
...
It is mostly an exercise for a few of us to learn TDD, so we are not that far into any code yet, but might be interesting to follow anyway?
Hi. I've been following that thread, but what I need is quite an advanced Lexer/Parser for a number of different formats - I can't imagine you'll be getting to that stage for a while!

Thanks anyway.

Re: multi-format text parser

Posted: Sat Apr 26, 2008 3:35 pm
by georgeoc
I've been looking further into the PEAR Text_Wiki class, and I'm now leaning towards rewriting it along similar lines, borrowing the design and simplifying the code to fit my own requirements.

I wonder if you can help me understand one oddity in the code. The regex parser replaces opening and closing tags with tokens, so this:

Code: Select all

[b[b]]bold text[/b[/b]]
would be transformed to this:

Code: Select all

_DELIM_1_DELIM_bold text_DELIM_2_DELIM_
where 1 and 2 are the id of the token in the master array, and _DELIM_ is a character string specified in the class description. My question is, why has the author chosen this particular character for the delimiter?

Code: Select all

var $delim = "\xFF";
Is there something special about that character? From an Extended ASCII table, it seems to be ΓΏ. But what's the reason for choosing that character?