Page 1 of 1

Saving user-submitted input

Posted: Tue May 01, 2007 5:37 pm
by Ambush Commander
For the demo page on HTML Purifier's website, I would like to implement some logging for sample input users submit to get more information on what people are testing, as well as allow them to refer back to previous tests in order to validate large amounts of input with W3C's validation service. I imagine it implemented like this:

1. User submits sample HTML to demo.php via POST
2. demo.php takes an md5, enters the data into the database, and then sends back a redirect to demo.php?h=md5-hash-of-html
3. User's browser redirects to that page
4. demo.php uses md5 hash to retrieve original text from database, purifies it, and then outputs it to user

This means that there is a permalink associated with any HTML submitted, the md5 hash of the data. There will be a 50kb cap on submitted text. The saved data would be accessible via a password-protected admin interface, otherwise, the user would have to know the md5 of the text they wish to access.

Are there any security/privacy implications to doing this?

Posted: Tue May 01, 2007 5:46 pm
by RobertGonzalez
Aside from users being silly enough to sample sensitive code AND an md5 hash collision on it, it wouldn't appear that there would be any serious implications.

Posted: Tue May 01, 2007 5:51 pm
by Ambush Commander
You'd be surprised. I would put a scary notice saying that all requests would be logged, but I'm afraid that would deter users. Hmm... how to word...

I'm not really worried about collisions, since the user input has to be obviously doctored to trigger that, and in that case I find no reason to accommodate the guilty party.

Posted: Tue May 01, 2007 5:53 pm
by superdezign
Well then, I'm curious. If you end up inputting long values into md5, won't there eventually be a copycat? md5 doesn't have an infinite number of combinations, and I could see using input that's longer than the amount of characters in the md5 increasing the chances of a repeated hash.

Am I wrong?

Posted: Tue May 01, 2007 5:59 pm
by RobertGonzalez
I think collisions will be rare. But it is the only thing that I can think of that would compromise privacy/security (other than an outright crack of the app) since that will be the identifier for the history of sampling.

Posted: Tue May 01, 2007 6:01 pm
by Ambush Commander
Yes. There are only about 3.40e+38 combinations. We'd need 18,446,744,073,709,551,616 entries before the probability of a random collision becomes more than 50%. Of course, long before then, my hosting plan would have run out of disk space. :P

md5 collisions are scary, but for our purposes I think the readability benefits of a shorter hash outweigh the problems of colliding hashes. (mm... maybe I could use crc!)

Posted: Tue May 01, 2007 8:13 pm
by feyd
CRC, or a more compressed form would suffice I think. .. Not to mention it being more email friendly.

One thing I would suggest is a lifetime associated with the data. Maybe even the ability for the user to choose not to have it stored (or period of time to store it.)

Posted: Tue May 01, 2007 8:37 pm
by Ambush Commander
While I like the idea of expiring the data, I myself would also like to take a look at what people are pumping through my form, in case I notice trends/common things people check, which seems to preclude expiration of the data.

Also, CRCs have a 50% chance of collision past 65,536 existing entries, which is too small for comfort.

Posted: Tue May 01, 2007 9:23 pm
by Z3RO21
What about a different hash algorithm? I contemplated the use of dual keys, but I do not know if this will help your situation.

Posted: Tue May 01, 2007 10:51 pm
by shiflett
The privacy issues seem like they would be very similar to those of pastebin.com. A quick glance there leads me to believe there is no stated privacy policy, and I think they could do a better job of making it clear that anything entered is public. I'd expect better from you. :-)

They do provide three options regarding how long the entry should persist, which might be an idea to consider.

Posted: Tue May 01, 2007 11:06 pm
by Ambush Commander
Sounds like an idea. Selectable options on how long the sample lasts and a link to a privacy page.

Looking over the md5 comments, this might be a more compact way of using MD5:

Code: Select all

rtrim(base64_encode(pack("H*",md5($data))),"=");
...which is 22 characters versus 32. demo.php?n=Ux/1AxD2+0XN8Ivx8fjs5A is quite palatable, although I don't think that URLs will be very appreciative of slashes and plus signs (although there appears to be a variant of base64 that uses filename safe characters, namely '-_' for '+/')

Posted: Wed May 02, 2007 2:05 am
by Ollie Saunders
Well then you'll have to urlencode it and it's going to get longer again. Why don't you just use a unique id? You're using a database for this so the ability to generate one is already provided.

Posted: Wed May 02, 2007 1:45 pm
by Ambush Commander
I considered using AUTOINCREMENT but the primary problem I see with that is it means it is really easy to trawl for previously submitted inputs simply by tweaking the ID numbers. Of course, that may not necessarily be a bad thing, but it's another privacy issue.

Posted: Wed May 02, 2007 1:55 pm
by RobertGonzalez
Why not offer the option of saving samples. If a user opts for this (nice little opt-in type exoneration) then you can assign them a generic user id of some sort (like an 8 char random string) which can be used to validate the entry id they select so no one can trawl the DB for other people's samples.

PS | This is not a fully thought through idea, but it might lead to something useful for you.

Posted: Wed May 02, 2007 3:58 pm
by Ambush Commander
Mmm... I don't want to require people to type in more opt-in stuff. Although that proposes another idea: crc (a substitute for eight char random string) + unique id (maintains uniqueness).