Saving user-submitted input
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Saving user-submitted input
For the demo page on HTML Purifier's website, I would like to implement some logging for sample input users submit to get more information on what people are testing, as well as allow them to refer back to previous tests in order to validate large amounts of input with W3C's validation service. I imagine it implemented like this:
1. User submits sample HTML to demo.php via POST
2. demo.php takes an md5, enters the data into the database, and then sends back a redirect to demo.php?h=md5-hash-of-html
3. User's browser redirects to that page
4. demo.php uses md5 hash to retrieve original text from database, purifies it, and then outputs it to user
This means that there is a permalink associated with any HTML submitted, the md5 hash of the data. There will be a 50kb cap on submitted text. The saved data would be accessible via a password-protected admin interface, otherwise, the user would have to know the md5 of the text they wish to access.
Are there any security/privacy implications to doing this?
1. User submits sample HTML to demo.php via POST
2. demo.php takes an md5, enters the data into the database, and then sends back a redirect to demo.php?h=md5-hash-of-html
3. User's browser redirects to that page
4. demo.php uses md5 hash to retrieve original text from database, purifies it, and then outputs it to user
This means that there is a permalink associated with any HTML submitted, the md5 hash of the data. There will be a 50kb cap on submitted text. The saved data would be accessible via a password-protected admin interface, otherwise, the user would have to know the md5 of the text they wish to access.
Are there any security/privacy implications to doing this?
- RobertGonzalez
- Site Administrator
- Posts: 14293
- Joined: Tue Sep 09, 2003 6:04 pm
- Location: Fremont, CA, USA
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
You'd be surprised. I would put a scary notice saying that all requests would be logged, but I'm afraid that would deter users. Hmm... how to word...
I'm not really worried about collisions, since the user input has to be obviously doctored to trigger that, and in that case I find no reason to accommodate the guilty party.
I'm not really worried about collisions, since the user input has to be obviously doctored to trigger that, and in that case I find no reason to accommodate the guilty party.
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
- RobertGonzalez
- Site Administrator
- Posts: 14293
- Joined: Tue Sep 09, 2003 6:04 pm
- Location: Fremont, CA, USA
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Yes. There are only about 3.40e+38 combinations. We'd need 18,446,744,073,709,551,616 entries before the probability of a random collision becomes more than 50%. Of course, long before then, my hosting plan would have run out of disk space. 
md5 collisions are scary, but for our purposes I think the readability benefits of a shorter hash outweigh the problems of colliding hashes. (mm... maybe I could use crc!)
md5 collisions are scary, but for our purposes I think the readability benefits of a shorter hash outweigh the problems of colliding hashes. (mm... maybe I could use crc!)
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
While I like the idea of expiring the data, I myself would also like to take a look at what people are pumping through my form, in case I notice trends/common things people check, which seems to preclude expiration of the data.
Also, CRCs have a 50% chance of collision past 65,536 existing entries, which is too small for comfort.
Also, CRCs have a 50% chance of collision past 65,536 existing entries, which is too small for comfort.
The privacy issues seem like they would be very similar to those of pastebin.com. A quick glance there leads me to believe there is no stated privacy policy, and I think they could do a better job of making it clear that anything entered is public. I'd expect better from you. :-)
They do provide three options regarding how long the entry should persist, which might be an idea to consider.
They do provide three options regarding how long the entry should persist, which might be an idea to consider.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Sounds like an idea. Selectable options on how long the sample lasts and a link to a privacy page.
Looking over the md5 comments, this might be a more compact way of using MD5:
...which is 22 characters versus 32. demo.php?n=Ux/1AxD2+0XN8Ivx8fjs5A is quite palatable, although I don't think that URLs will be very appreciative of slashes and plus signs (although there appears to be a variant of base64 that uses filename safe characters, namely '-_' for '+/')
Looking over the md5 comments, this might be a more compact way of using MD5:
Code: Select all
rtrim(base64_encode(pack("H*",md5($data))),"=");- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- RobertGonzalez
- Site Administrator
- Posts: 14293
- Joined: Tue Sep 09, 2003 6:04 pm
- Location: Fremont, CA, USA
Why not offer the option of saving samples. If a user opts for this (nice little opt-in type exoneration) then you can assign them a generic user id of some sort (like an 8 char random string) which can be used to validate the entry id they select so no one can trawl the DB for other people's samples.
PS | This is not a fully thought through idea, but it might lead to something useful for you.
PS | This is not a fully thought through idea, but it might lead to something useful for you.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US