Page 1 of 2

Best Way to Compress Hundreds of GB

Posted: Mon Aug 24, 2015 3:04 pm
by Skara
I've brought up more specific questions related to this problem previously, but now I'm running into a new wall with larger filesizes.

Problem: Backing up hundreds of gigabytes (per day) to AWS Glacier.

I work at a film production company. The newest camera we're investing in shoots somewhere between 19 and 180 MB per second, depending on camera settings.
** FYI, that's already about as compressed as it's going to get. Uncompressed camera footage (assuming you had the capability to record it) would be about 100 times that large.

I was pretty close to a solution with our previous gear, but the increased filesize of the newer camera is a challenge. I receive a single file today that was 180GB, and that wasn't recorded at the highest quality.

SO: I'm using the AWS Glacier PHP SDK. Uploading...shouldn't be a problem due to the way the SDK chunks the files. And (I think) Glacier's max filesize is 4TB. Processing before upload is the problem. I'm running out of memory before I should. I'm trying to optimize what I have, but here's my question...

What's the best way to condense files into a single file (e.g. .tar) to upload? Filesize varies significantly. Some archives have a few dozen xml files, totalling around 1MB, while others have a few dozen video files, averaging 2-20GB each, with the total average somewhere between 4GB and 400GB. Is there an existing (relativley straightforward) solution?

Re: Best Way to Compress Hundreds of GB

Posted: Mon Aug 24, 2015 3:50 pm
by requinix
I think I would call `tar` from the command-line, it being the most "obvious" solution and likely one of the simplest. You may want to investigate moving work into shell scripts (and cron jobs?) and out of PHP code, I don't know.

Looking at the PHP source code, I think PharData will also work for a more PHP-oriented approach: it seems to operate on streams so adding a file should not try to load it into memory.

Re: Best Way to Compress Hundreds of GB

Posted: Mon Aug 24, 2015 5:37 pm
by Christopher
I would not waste time putting compressed files into an archive to then copy. In this case, I think I would have the backup script create a directory with the date as the name in a backups directory, then move the files in to that created directory, then rsync the whole backups directory to AWS Glacier. You could then remove the directory and its contents. rsync will only copy files not on AWS Glacier so you can keep as many or few files local as you need.

Re: Best Way to Compress Hundreds of GB

Posted: Tue Aug 25, 2015 9:13 am
by Skara
Well, the php script runs as a cron job at 7pm every day. The SDKs available are PHP, Ruby, and Java. I loathe Java :) and I'm not really familiar enough with Ruby. I could separate out the archiving into another language and process, but it would just add a layer of complexity that I'd rather avoid if possible.

Does calling `tar` from php not have memory issues? I think PharData does a better job just from it being stream-based, but maybe I'm wrong.

As far as not archiving first, AWS pricing is complex, but to simplify it: the more files, the more requests, and the more cost. By tar'ing the files on my side before upload, I can keep the costs pretty close to the advertised $0.01/GB.

FYI, PharData seems to be great with memory... until you use PharData::compress(). For a 1.5GB file, using PharData::compress(Phar::GZ) ramped up my memory to > 2GB. Not super-useful in my case anyway since I'm working with video files, but an issue if compression is important.

Re: Best Way to Compress Hundreds of GB

Posted: Tue Aug 25, 2015 1:02 pm
by Christopher
Skara wrote:As far as not archiving first, AWS pricing is complex, but to simplify it: the more files, the more requests, and the more cost. By tar'ing the files on my side before upload, I can keep the costs pretty close to the advertised $0.01/GB.
Archiving compressed files may increase the total size due to the added tar wrapper data. rsync will compress non-compressed files before transferring. And obviously only transferring files that have changed minimizes transfers.

Re: Best Way to Compress Hundreds of GB

Posted: Tue Aug 25, 2015 1:22 pm
by requinix
Skara wrote:Does calling `tar` from php not have memory issues? I think PharData does a better job just from it being stream-based, but maybe I'm wrong.
Both tar and gzip create output in sequence, meaning that they do not need to seek through files. So they shouldn't have memory issues.
Skara wrote:As far as not archiving first, AWS pricing is complex, but to simplify it: the more files, the more requests, and the more cost. By tar'ing the files on my side before upload, I can keep the costs pretty close to the advertised $0.01/GB.
tar will add some (not much) overhead per file, like Christopher said, and if the files are already quite compressed then gzip might not help much. The easiest way to check is to create a tar+gzip and see how the filesize compares to the sum of the source file sizes.

Re: Best Way to Compress Hundreds of GB

Posted: Tue Aug 25, 2015 4:53 pm
by Christopher
I am wondering why you'd want to tar the files together anyway? Maybe it makes sense for the small files if they are related (plus they are compressible) . But I'd want the video files uploaded individually so I could pull off just the ones I needed later -- instead of having to waste time downloading a bunch of files to get one or a few.

Re: Best Way to Compress Hundreds of GB

Posted: Tue Aug 25, 2015 6:50 pm
by requinix
Christopher wrote:I am wondering why you'd want to tar the files together anyway? Maybe it makes sense for the small files if they are related (plus they are compressible) . But I'd want the video files uploaded individually so I could pull off just the ones I needed later -- instead of having to waste time downloading a bunch of files to get one or a few.
It makes sense if you consider how Glacier works: you get lots of storage to upload whatever you want but retrieving files takes time. Like, hours. No really. Glacially slow, as it were.
If you think that you'd want access to multiple files at once then an archive is the way to go: separate files means separate retrieval requests (and assorted charges) while one single archive for everything only means one request.

Re: Best Way to Compress Hundreds of GB

Posted: Tue Aug 25, 2015 11:21 pm
by Christopher
requinix wrote:It makes sense if you consider how Glacier works: you get lots of storage to upload whatever you want but retrieving files takes time. Like, hours. No really. Glacially slow, as it were.
If you think that you'd want access to multiple files at once then an archive is the way to go: separate files means separate retrieval requests (and assorted charges) while one single archive for everything only means one request.
I guess the actual data and usage would determine the trade-offs. I think my concern is that if you only want a few huge files in a archive with 10-20 huge files, then you will take glacially more time to download the whole archive and you will have to retrieve much more data (and cost) than necessary. If you always want the whole set of files, then tar'ing them makes sense.

Re: Best Way to Compress Hundreds of GB

Posted: Wed Aug 26, 2015 9:56 am
by Skara
1) If we ever needed to download one file, it would be because we lost the entire shoot, so we'd want to download all of it anyway.
2) The tiny bit of overhead the tar wrapper has doesn't bother me. The cost of uploading hundreds of files versus dozens makes a much bigger difference.
requinix wrote:Both tar and gzip create output in sequence, meaning that they do not need to seek through files. So they shouldn't have memory issues.
Interesting. Becuase using gzip through PharData kills my memory even on (relatively) small files. 1.5GB or more eats up over 2GB of memory and kills the process. So far I'm successfully taring files up to 10GB without using gzip.

Re: Best Way to Compress Hundreds of GB

Posted: Wed Aug 26, 2015 12:12 pm
by Christopher
Skara wrote:Interesting. Becuase using gzip through PharData kills my memory even on (relatively) small files. 1.5GB or more eats up over 2GB of memory and kills the process. So far I'm successfully taring files up to 10GB without using gzip.
If the data files are compressed, do you need to gzip? What is the difference in size between just a .tar and a .tgz of the same data files?

Re: Best Way to Compress Hundreds of GB

Posted: Wed Aug 26, 2015 3:39 pm
by requinix
Skara wrote:Interesting. Becuase using gzip through PharData kills my memory even on (relatively) small files. 1.5GB or more eats up over 2GB of memory and kills the process. So far I'm successfully taring files up to 10GB without using gzip.
I was thinking of the command-line programs, not PHP. It may be that PharData tries to read everything through memory while compressing.

You could use PharData to create tars then run them through gzip on the command-line after they're finalized. But you really should check to see if gzipping actually gets you any measurable gains.

Re: Best Way to Compress Hundreds of GB

Posted: Thu Aug 27, 2015 8:26 pm
by Vegan
7-zip can carve up a stream into any size chunk you want, making it a handy tool for HTTP which tends to be a problem above 500MB sized files

7-zip has a store option with no compression which means istream and ostream are all that are used with no intervening compression
its advisable to get MD5SUMMER as well to make sure the parts are uploaded intact

Re: Best Way to Compress Hundreds of GB

Posted: Mon Aug 31, 2015 1:41 pm
by Skara
Hm. Well, no, I don't actually need to compress most of the files. I was running it because some of the data is textual (XML, usually), but I've just given up on that.

To upload to AWS, the entire payload as well as each part uploaded has to be treehashed. Way more intensive than a simple md5.

Anyway, thanks all. I haven't gotten to the 100GB+ uploads, only getting 5-10GB archives uploaded right now, but if I run into issues on larger files I'll come back for help.

Re: Best Way to Compress Hundreds of GB

Posted: Mon Aug 31, 2015 8:07 pm
by Vegan
man you have a demanding workload

mind you computer chess is so extreme that its an ideal platform for research

if md5 is not enough, php has sha1 available to make a has of whatever