Consistent DB references to uploaded media files in filesys
Moderator: General Moderators
Consistent DB references to uploaded media files in filesys
Hi;
LAMP infrastructure. We have a website that allows users to upload media & image files. We keep the files stored as files in the file system, and keep references to them in the MySQL db.
Inconsistencies happen, however. Maybe a code "upgrade" that doesn't work as it should, or a system crash between the file upload or delete & the db registration... point is, sometimes there are files in the upload directory that are not represented in the DB, and v/v; files referred to in the DB that don't exist in the filesystem.
My question is, are there any good practices, standard approaches to ensuring consistency between the file system & DB? Obviously, we can run a periodic cron job that deletes files in the upload directory that don't have references in the DB, and delete records in the DB that don't have a corresponding existing file in the filesystem. And maybe that's the way to go, but I wondered if there is another more clever approach that hasn't occurred to me.
Any suggestions?
Thanks,
Paul
LAMP infrastructure. We have a website that allows users to upload media & image files. We keep the files stored as files in the file system, and keep references to them in the MySQL db.
Inconsistencies happen, however. Maybe a code "upgrade" that doesn't work as it should, or a system crash between the file upload or delete & the db registration... point is, sometimes there are files in the upload directory that are not represented in the DB, and v/v; files referred to in the DB that don't exist in the filesystem.
My question is, are there any good practices, standard approaches to ensuring consistency between the file system & DB? Obviously, we can run a periodic cron job that deletes files in the upload directory that don't have references in the DB, and delete records in the DB that don't have a corresponding existing file in the filesystem. And maybe that's the way to go, but I wondered if there is another more clever approach that hasn't occurred to me.
Any suggestions?
Thanks,
Paul
Re: Consistent DB references to uploaded media files in filesys
I can think of a couple things:
1. Store uploaded files in a temporary directory until the interaction is complete and the db record is created. Only then move the files to public directory you are serving them from ('/upload' or whatever). Clean the temporary directory periodically with a cron job.
2. When you fetch a db record containing a file name, check if that file physically exists on the server using is_readable() or is_file(). If that file does not exist, delete the db record and show a place holder page instead ("We're sorry, the file you requested does not exist" or something to that effect).
1. Store uploaded files in a temporary directory until the interaction is complete and the db record is created. Only then move the files to public directory you are serving them from ('/upload' or whatever). Clean the temporary directory periodically with a cron job.
2. When you fetch a db record containing a file name, check if that file physically exists on the server using is_readable() or is_file(). If that file does not exist, delete the db record and show a place holder page instead ("We're sorry, the file you requested does not exist" or something to that effect).
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
Re: Consistent DB references to uploaded media files in filesys
For starters I would try and determine why these inconsistencies were happening. Are the files not being written/moved to the "real" directory after upload because of file permissions?
Are the files failing to upload because of system constraints (PHP script execution timeout, filesize > than allowed, etc).
Secondly, you could store the files in the database itself and use PDO and it's transaction support to ensure nothing is permenant unless it's tripping out on ACID.
Running a cron job sounds like a hack you apply after the system has been designed and you need a quick fix.
Are the files failing to upload because of system constraints (PHP script execution timeout, filesize > than allowed, etc).
Secondly, you could store the files in the database itself and use PDO and it's transaction support to ensure nothing is permenant unless it's tripping out on ACID.
Running a cron job sounds like a hack you apply after the system has been designed and you need a quick fix.
Re: Consistent DB references to uploaded media files in filesys
Have an option in the administration panel, or whatever the site has, that does the following things:
1] For each file in the upload directory that doesn't have a reference in the database, creates a reference with "default" attributes.
2] For each record that doesn't have a corresponding file, deletes it.
This is probably the safest thing to do, since it won't destroy all the files if something goes wrong with your DB.
1] For each file in the upload directory that doesn't have a reference in the database, creates a reference with "default" attributes.
2] For each record that doesn't have a corresponding file, deletes it.
This is probably the safest thing to do, since it won't destroy all the files if something goes wrong with your DB.
Re: Consistent DB references to uploaded media files in filesys
If you need concurrency control between the binary media files and their table based meta-data, then the binary data needs to go into the RDBMS. That is what transaction control is. A few might gawk at the fact I recommended putting files in the database, but the reality is it is 2008, and a database field can handle a multi-GB file no problemo. I'd always put data on the filesystem if I had a choice though.
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
Re: Consistent DB references to uploaded media files in filesys
WHy does it "need" to go in the database? It makes things a little easier but transactional models (I think) would still help even if you don't store the raw file data in the DB.If you need concurrency control between the binary media files and their table based meta-data, then the binary data needs to go into the RDBMS. That is what transaction control is.
Check if the file uploads properly and if it doesn't, rollback any changes you have made hiterto and report an error.
It's more secure and (I believe) actually faster in terms of retreival, so long as you use indexes. From what I understand of the low level mechnics of RDBMS...records are rarely ever "removed" in real time...in stead most RDBMS mark that record as "deleted" but keep it in place. Basically your data might become fragmented and with BLOB's that can be pressing issue as you might have gapping 2GB hole or more in a single table.few might gawk at the fact I recommended putting files in the database, but the reality is it is 2008, and a database field can handle a multi-GB file no problemo. I'd always put data on the filesystem if I had a choice though.
The other challenge (despite being extremely simple) is the additional proxy now needed to access the file data indepently...unless you set the headers correctly some browsers will not deal with a proxied data file properly.
I'm on the fence...I have a fairly well tested proxy script with a lot of hacks, checks, etc to ensure files are delivered successfully under most browsers...with the use of .htaccess a end user would never even know I was using a proxy...but I still like to keep files on the file system which offers a single huge benefit.
The are directly updatable/modified using a plethora of existing tools...
Once data is stored in the database...you need to use a database specific client to get at that data...and say good bye to version control, unless you implement something webdav_svn or something I suppose.
One day I will move all data over to the database though...that is a given.
Cheers,
Alex
Re: Consistent DB references to uploaded media files in filesys
Having the exact filename and path is not much different than having an index in the database, so I don't believe retrieval speed would (significantly) different. What could be different, depending on the storage engine, is writing - if you use an engine that uses table-locks for writing (MyISAM), storing a 2gb file would basically cripple the database.It's more secure and (I believe) actually faster in terms of retreival, so long as you use indexes.
You got that a little mixed up. Some storage engines don't release the space used by deleted rows, instead they reallocate it to new rows as they are inserted. You can run OPTIMIZE TABLE on a table to reclaim this unused space - however, the records are indeed deleted.From what I understand of the low level mechnics of RDBMS...records are rarely ever "removed" in real time...in stead most RDBMS mark that record as "deleted" but keep it in place. Basically your data might become fragmented and with BLOB's that can be pressing issue as you might have gapping 2GB hole or more in a single table.
Re: Consistent DB references to uploaded media files in filesys
They need to go in the database, because the database implements the 'A' in ACID. http://www.opensourcearmenia.com/educat ... rdbms.html
Why roll your own transactional system when the RDBMS provides one?
Why roll your own transactional system when the RDBMS provides one?
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
Re: Consistent DB references to uploaded media files in filesys
With a database there is no additional overhead in the file system being included into the equation. Granted most file systems retreive files fast...I do believe that a DB does it faster...I seen a study by IBM a few years back which backed up that claim with various tests under several of Linux VFS (Ext, etc) so unless your system uses a VFS which has faster retreival than a DB (which most servers would not bother doing unless your google) a DB does retreive the data faster, at least from the perspective of the caller.Having the exact filename and path is not much different than having an index in the database, so I don't believe retrieval speed would (significantly) different.
Not sure I follow...can you explain? Usually BLOB data is stored in it's own table (I typically use MyISAM -- just by default) and I have never had a problem when I did use DB.What could be different, depending on the storage engine, is writing - if you use an engine that uses table-locks for writing (MyISAM), storing a 2gb file would basically cripple the database.
I don't think I mixed it up...I just excluded the part where storage engines will reuse deleted record slots when creating a new record.You got that a little mixed up. Some storage engines don't release the space used by deleted rows, instead they reallocate it to new rows as they are inserted. You can run OPTIMIZE TABLE on a table to reclaim this unused space - however, the records are indeed deleted.
http://dev.mysql.com/doc/refman/5.0/en/ ... table.html
In this case when variable length records are used they cannot be as easily reused for the next insert, and seeing as BLOB is usually variable length, I didn't feel it appropriate to mention, but I guess I was wrong.
Not sure who you are refering to, but I assume me, in which case...who said anything about rolling your own transactional model? I implied using transactions in conjunction with your PHP code. Basically why not upload the file, check if it fails to upload (permissions, size, etc) and if it does ROLLBACK any CRUD operations made up uto then?Why roll your own transactional system when the RDBMS provides one?
Cheers,
Alex
Re: Consistent DB references to uploaded media files in filesys
Unless the table is loaded in memory (which is doubtful if it stores 2gb files per row), a database works off the file system and has the additional overhead of parsing / finding the required rows. So I don't see how it can be faster than using the actual file-system.With a database there is no additional overhead in the file system being included into the equation.
ISAM engines employ table-locks for writing, meaning the table can not be read or modified while a write operation is in progress (INSERT / UPDATE / DELETE). Loading large files into a database takes time, the same as any file-system operation dealing with large files - during that time, any queries attempting to access that table would be placed in a queue until the query finishes. So while you are inserting a new file to your database, other files can not be added or read from it.Not sure I follow...can you explain? Usually BLOB data is stored in it's own table (I typically use MyISAM -- just by default) and I have never had a problem when I did use DB.
I read the same link before posting. If you are refering to this sentence:I don't think I mixed it up...I just excluded the part where storage engines will reuse deleted record slots when creating a new record.
http://dev.mysql.com/doc/refman/5.0/en/ ... table.html
It is somewhat ambiguous. I don't think the actual data is maintained, but only the space allocation.Deleted rows are maintained in a linked list
Re: Consistent DB references to uploaded media files in filesys
It depends on your access patterns. I had tons of files where I had to open the file, fseek to a header, read a fixed length metadata container, parse that metadata, then fseek to points specified in the meta data to read back polygons embedded in large binary files. I wrote a script to go through and store each polygon as binary data, 1 row = 1 polygon, 2 tables, a metadata table and a BLOB lookup table, performance improved 100x +pytrin wrote:a database works off the file system and has the additional overhead of parsing / finding the required rows. So I don't see how it can be faster than using the actual file-system.
Re: Consistent DB references to uploaded media files in filesys
But if you were to store the same files directly into the database and retrieve them to perform the manipulations you described, you would have had no speed benefits. That's what I was referring to.
Re: Consistent DB references to uploaded media files in filesys
Of course, I'm just saying if you're doing more than storage retrieval, for instance if you're doing transactional concurrency control of the data, or if you need to explicitly chunk your data, or any other # of reasons, the database is more applicable, (you should benchmark it both ways if in doubt )
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
Re: Consistent DB references to uploaded media files in filesys
I've searched like hell and high water for that article and found nothing...I'm sure it was an IBM or Oracle source.database works off the file system and has the additional overhead of parsing / finding the required rows. So I don't see how it can be faster than using the actual file-system.
The argument has long been true, that the overhead in using a RDBMS for retreiving files was true but somehow I don't think that is true anymore.
While Googling now all I found was arguments (and apparent truth's) on how a file IS faster...but none of these articles are on IBM or similar...and most importantly none offered extensive benchmarks using various VFS and table types, etc...
The article I am refering to was quite thourough and certainly convinced me the author has really done his homework and determined the database retreival was usually faster, whereas storing files had the benefit of faster deleting and updating.
There are more benefits in storing the file in the DB though...and any performance issues could be solved by using a trivial cache mechanism of storing frequently access files on the disk, but I'm not sure that is really nessecary. I'd have to run becnhmarks myself now to be sure, as there are many who aruge otherwise.
What is an argument though, without reproducable benchmarks?
I see, I figured something that. My counter argument again...is that I use separate tables to store BLOB data and in most applications file uploading is not the primary purpose (outside of a file manager I suppose). In those cases one could use a different (row-locking) table or store files on disk and bulk import via cron or similar, no?ISAM engines employ table-locks for writing, meaning the table can not be read or modified while a write operation is in progress (INSERT / UPDATE / DELETE). Loading large files into a database takes time, the same as any file-system operation dealing with large files - during that time, any queries attempting to access that table would be placed in a queue until the query finishes. So while you are inserting a new file to your database, other files can not be added or read from it.
Maintained? As I understand it the record is flagged/marked as a deleted if the rows are fixed sized. The record will not show up in a SELECT but will be over written upon the next INSERT. On variable length records the record space remains marked until you run an OPTIMIZE and defragment your table. At least that is what would make sense to me at first thought. Of course there may be more advanced algorithms now which make better use of fragmented space, but I think OPTIMIZE is probably the most efficient.It is somewhat ambiguous. I don't think the actual data is maintained, but only the space allocation.