Page 1 of 1

stream zip file in chunks... Is it possible ?*NOW WITH CODE*

Posted: Wed Mar 22, 2006 12:38 pm
by benkor
Hello everyone,

I'm new here and I need some help... :D
I'm writing a script to compress (zip) and send multiple files with PHP. I can have a lot of files at once, as well as really huge files in the lot (several hundred MB). So I figured I would do the process by splitting files in chunks, encode them one by one to avoid memory problems and send them out one after another.

But I encounter a big problem : I can't see how to compress (with either gzcompress or deflate) chunks of a file and keep its integrity. The tests I made show that the compressed file is different (and corrupted) when I compress it by chunks. I know it works by using gzwrite with a loop in a temp file, but I want to stay in memory to avoid multiple file access, and be able to send data out before the file is completely zipped...

So my question is : can I compress multiple files on-the-fly, chunk by chunk, and send them out as a single streamed zip file to the client ?

UPDATE : I added my code for you to check out :
Basically what I do now is :
get a file from ftp server > read it by chunks > gzwrite chunks in tmp file > read entire tmpfile in memory > add it into archive > close archive > send archive URL for download.
And what I want to do is :
get a file from ftp server > send file headers to client > read it by chunks > compress (or deflate) chunks in memory > send chunks > process next files > send global archive descriptor > voilà, the whole archive has been sent to client without ever having either a whole file in memory or waiting for a file to be completely compressed before sending it.

So what do you think ?

Code: Select all

<?php
	// include zip lib
	require_once('zip.lib.php');
	
	// avoid PHP timeouts
	set_time_limit(0);
	ini_set("max_execution_time", 0);
	
	// unique id for this particular archive process
	$id = md5(rand(0, 1000000));
	
	// change extension for Mac users
	if (ereg("Macintosh", $_SERVER["HTTP_USER_AGENT"]) || ereg("Mac_PowerPC", $_SERVER["HTTP_USER_AGENT"])) {
		$archiveFileName = "archive" . $id . ".sit";
	} else {
		$archiveFileName = "archive" . $id . ".zip";
	}

	// open ftp connection to retrieve files
	$connection = ftp_connect($_POST["server"]) or die("Connexion échouée.");
	ftp_login($connection, $_POST["login"], $_POST["password"]) or die("Login échoué.");
	ftp_pasv($connection, false);

	ftp_chdir($connection, $_POST["root"]) or die("Dossier non ouvert.");

	// get total size of files to process
	$tmpTotalSize=0;
	
	$i=0;
	while (isset($_POST["folder_".$i]))	{
		addFolderSize($connection, $root, ".", $_POST["folder_".$i]);
		$i++;
	}
	
	$i=0;
	while (isset($_POST["file_".$i])) {
		$tmpTotalSize += ftp_size($connection, $_POST["file_".$i]);
		$i++;
	}

	// session vars for status check with another script
	session_start();
	$_SESSION["processedSize"] = 0;
	$_SESSION["totalSize"] = $tmpTotalSize;
	$_SESSION["archiveFileName"] = $archiveFileName;
	session_write_close();
	
	// define size of data chunks
	$chunkSize = 32000;

	// local temp file : each file is transferred from ftp to the web server before being compressed
	$tmp_file_name = "download" . $id . ".tmp";

	// create archive
	$zip = new zipfile();
	
	// recursive function to process directories and subdirectories
	function addFolder($connection, $root, $path, $folder, $zip, $archiveFileName, $chunkSize)
	{
		ftp_chdir($connection, $folder);
		$tab = @ftp_rawlist($connection, ".");

		for($i=1; $i<count($tab); $i++)	{
			if ($tab[$i][0] == "d")	{
				$folder_ = substr(strrchr($tab[$i], " "), 1);

				if ($folder_ != "." && $folder_ != "..") {
					addFolder($connection, $root."/".$folder, $path."/".$folder, $folder_, $zip, $archiveFileName, $chunkSize);
				}
			} else {
				$file_name = substr(strrchr($tab[$i], " "), 1);
				if (ftp_get($connection, $GLOBALS["tmp_file_name"], $file_name, FTP_BINARY)) {
					$zip->addFile($archiveFileName, $GLOBALS["tmp_file_name"], $path."/".$folder."/".$file_name, 0, $chunkSize);

					unlink($GLOBALS["tmp_file_name"]);
				}
			}
		}
		ftp_cdup($connection);
	}
	
	// process directories
	$i=0;
	while (isset($_POST["folder_".$i]))	{
		addFolder($connection, $_POST["root"], ".", $_POST["folder_".$i], $zip, "tmp/" . $archiveFileName, $chunkSize);
		$i++;
	}

	// process files
	$i=0;
	while (isset($_POST["file_".$i])) {
		if (ftp_get($connection, $tmp_file_name, $_POST["file_".$i], FTP_BINARY)) {
			$path = $root;
			$zip->addFile("tmp/" . $archiveFileName, $tmp_file_name, $_POST["file_".$i], 0, $chunkSize);
			
			unlink($tmp_file_name);
		}
		$i++;
	}
	
	// close ftp connection
	ftp_close($connection);
	
	// finalize archive
	$zip->closeArchive("tmp/" . $archiveFileName);
	
	// save archive filename before unset
	$output = $archiveFileName;
	
	// unset session vars
	session_start();
	unset($_SESSION["processedSize"]);
	session_unregister($processedSize);
	$_SESSION["totalSize"] = 0;
	$_SESSION["archiveFileName"] = "";
	session_write_close();
	
	// set back PHP timeout values
	set_time_limit(30);
	ini_set("max_execution_time", 300);
	
	// echo archive filename
	echo ($output);	
?>
and my modified zip class lib :

Code: Select all

<?php
/* $Id: zip.lib.php,v 1.6 2002/03/30 08:24:04 loic1 Exp $ */


/**
 * Zip file creation class.
 * Makes zip files.
 *
 * Based on :
 *
 *	http://www.zend.com/codex.php?id=535&single=1
 *	By Eric Mueller <eric@themepark.com>
 *
 *	http://www.zend.com/codex.php?id=470&single=1
 *	by Denis125 <webmaster@atlant.ru>
 *
 *	a patch from Peter Listiak <mlady@users.sourceforge.net> for last modified
 *	date and time of the compressed file
 *
 * Official ZIP file format: http://www.pkware.com/appnote.txt
 *
 * @access	public
 */
class zipfile
{
	/**
	 * Array to store compressed data
	 *
	 * @var  array	  $datasec
	 */
	//var $datasec		= array();

	/**
	 * Central directory
	 *
	 * @var  array	  $ctrl_dir
	 */
	var $ctrl_dir	  = array();

	/**
	 * End of central directory record
	 *
	 * @var  string   $eof_ctrl_dir
	 */
	var $eof_ctrl_dir = "\x50\x4b\x05\x06\x00\x00\x00\x00";

	/**
	 * Last offset position
	 *
	 * @var  integer  $old_offset
	 */
	var $old_offset   = 0;
	
	// offset header pour ajustement après écriture data compressées
	//var $fileHeaderOffset = 1000000;
	
	// offset général pour l'ajout du fichier suivant dans l'archive (attention au header global écrit en fin de fichier)
	var $currentArchiveOffset = 0;
	
	// longueur chaine de l'archive complète sans header global
	var $archiveStrlen = 0;
	
	/**
	 * Converts an Unix timestamp to a four byte DOS date and time format (date
	 * in high two bytes, time in low two bytes allowing magnitude comparison).
	 *
	 * @param  integer	the current Unix timestamp
	 *
	 * @return integer	the current date in a four byte DOS format
	 *
	 * @access private
	 */
	function unix2DosTime($unixtime = 0)
	{
		$timearray = ($unixtime == 0) ? getdate() : getdate($unixtime);

		if ($timearray['year'] < 1980) {
			$timearray['year']	  = 1980;
			$timearray['mon']	  = 1;
			$timearray['mday']	  = 1;
			$timearray['hours']   = 0;
			$timearray['minutes'] = 0;
			$timearray['seconds'] = 0;
		}

		return (($timearray['year'] - 1980) << 25) | ($timearray['mon'] << 21) | ($timearray['mday'] << 16) |
				($timearray['hours'] << 11) | ($timearray['minutes'] << 5) | ($timearray['seconds'] >> 1);
	} // end of the 'unix2DosTime()' method
	
	
	// bit by bit CRC32  from php.net documentation notes //
	function bitbybit_crc32($str,$first_call){
	
	   //reflection in 32 bits of crc32 polynomial 0x04C11DB7
	   $poly_reflected=0xEDB88320;
	
	   //=0xFFFFFFFF; //keep track of register value after each call
	   static $reg=0xFFFFFFFF;
	
	   //initialize register on first call
	   if($first_call) $reg=0xFFFFFFFF;
	  
	   $n=strlen($str);
	   $zeros=$n<4 ? $n : 4;
	
	   //xor first $zeros=min(4,strlen($str)) bytes into the register
	   for($i=0;$i<$zeros;$i++)
		   $reg^=ord($str{$i})<<$i*8;
	
	   //now for the rest of the string
	   for($i=4;$i<$n;$i++){
		   $next_char=ord($str{$i});
		   for($j=0;$j<8;$j++)
			   $reg=(($reg>>1&0x7FFFFFFF)|($next_char>>$j&1)<<0x1F)
				   ^($reg&1)*$poly_reflected;
	   }
	
	   //put in enough zeros at the end
	   for($i=0;$i<$zeros*8;$i++)
		   $reg=($reg>>1&0x7FFFFFFF)^($reg&1)*$poly_reflected;
	
	   //xor the register with 0xFFFFFFFF
	   return ~$reg;
	}
	

	/**
	 * Adds "file" to archive, chunks version
	 *
	 * @param  string	archive file name
	 * @param  string	file to compress
	 * @param  string	name of the file in the archive (contains the path)
	 * @param  integer	the current timestamp
	 * @param  integer	chunk size in bytes
	 *
	 * @access public
	 */
	function addFile($archiveFile, $tmpFile, $fileName, $time = 0, $chunkSize)
	{		
		// replace slashes
		$fileName = str_replace('\\', '/', $fileName);
		if (substr($fileName, 0, 2) == "./") {
			$fileName = substr($fileName, 2);
		}

		// compute crc32 (only for small files)
		$crcCreated = false;
		if (filesize($tmpFile) < 1000000) {
			$crcCreated = true;
			$crc = crc32(file_get_contents($tmpFile));
		}

		// open source file
		$sourceFileHandler = fopen($tmpFile, "r");
		
		// create and gzopen tmp compressed data file
		$tmpCompressedFileName = md5(rand(0,1000000)) . ".tmp";
		$compressedFileHandler = gzopen($tmpCompressedFileName, "w");
		
		for($i=0; $i < filesize($tmpFile); $i+=$chunkSize) {
			// get chunk string
			$tmpData = fread($sourceFileHandler, $chunkSize);

			// compress and write
			gzwrite($compressedFileHandler, $tmpData);
			
			// compute crc32 (only for big files)
			if (!$crcCreated) {
				if ($i==0) {
					$crc = $this->bitbybit_crc32($tmpData, true);
				} else {
					$crc = $this->bitbybit_crc32($tmpData, false);
				}
			}

			// update session var
			session_start();
			$_SESSION["processedSize"] += $chunkSize;
			session_write_close();
		}

		// close files
		fclose ($sourceFileHandler);
		gzclose($compressedFileHandler);
		
		// retrieve all compressed data
		$compressedFileHandler = fopen($tmpCompressedFileName, "r");		
		$compressedData = fread($compressedFileHandler, filesize($tmpCompressedFileName));		
		fclose ($compressedFileHandler);
		
		// delete tmp compressed file
		unlink($tmpCompressedFileName);
		
		// remove old crc16 (4) and gzwrite file header (10)
		$compressedData = substr(substr($compressedData, 0, strlen($compressedData) - 4), 10);
		
		// file header
		$dtime	  = dechex($this->unix2DosTime($time));
		$hexdtime = '\x' . $dtime[6] . $dtime[7]
				  . '\x' . $dtime[4] . $dtime[5]
				  . '\x' . $dtime[2] . $dtime[3]
				  . '\x' . $dtime[0] . $dtime[1];
		eval('$hexdtime = "' . $hexdtime . '";');

		$fileHeader		= "\x50\x4b\x03\x04";
		$fileHeader		.= "\x14\x00";						// ver needed to extract
		$fileHeader		.= "\x00\x00";						// gen purpose bit flag, default
		//$fileHeader	.= "\x00\x04";						// gen purpose bit flag, bit 3 switched for streaming of zip file
		$fileHeader		.= "\x08\x00";						// compression method
		$fileHeader		.= $hexdtime; 						// last mod time and date

		$c_len	 = strlen($compressedData); 				// compressed length
		$unc_len = filesize($tmpFile);						// uncompressed length

		$fileHeader 	.= pack('V', $crc);					// crc32
		//$fileHeader	.= pack('V', 0);					// 0 for streaming		
		$fileHeader		.= pack('V', $c_len);				// compressed filesize
		//$fileHeader	.= pack('V', 0);					// 0 for streaming		
		$fileHeader		.= pack('V', $unc_len);				// uncompressed filesize
		//$fileHeader	.= pack('V', 0);					// 0 for streaming		
		$fileHeader		.= pack('v', strlen($fileName));	// length of filename
		$fileHeader		.= pack('v', 0);					// extra field length
		$fileHeader		.= $fileName;

		// file footer
		// "data descriptor" segment (optional but necessary if archive is not served as file) (=file specific footer)
		$fileFooter .= pack('V', $crc);						// crc32
		$fileFooter .= pack('V', $c_len);					// compressed filesize
		$fileFooter .= pack('V', $unc_len);					// uncompressed filesize

		// concatenate all file data
		$compressedFile = $fileHeader . $compressedData . $fileFooter;
		
		// write file in archive
		$archiveHandler = fopen($archiveFile, "a+");
		fwrite ($archiveHandler, $compressedFile);
		fclose ($archiveHandler);
		
		$archiveStrlen += strlen($compressedFile);
		
		// central directory record
		$cdrec = "\x50\x4b\x01\x02";
		$cdrec .= "\x00\x00";					// version made by
		$cdrec .= "\x14\x00";					// version needed to extract
		$cdrec .= "\x00\x00";					// gen purpose bit flag
		$cdrec .= "\x08\x00";					// compression method
		$cdrec .= $hexdtime;					// last mod time & date
		$cdrec .= pack('V', $crc);				// crc32
		$cdrec .= pack('V', $c_len);			// compressed filesize
		$cdrec .= pack('V', $unc_len);			// uncompressed filesize
		$cdrec .= pack('v', strlen($fileName));	// length of filename
		$cdrec .= pack('v', 0 );				// extra field length
		$cdrec .= pack('v', 0 );				// file comment length
		$cdrec .= pack('v', 0 );				// disk number start
		$cdrec .= pack('v', 0 );				// internal file attributes
		$cdrec .= pack('V', 32 );				// external file attributes - 'archive' bit set

		$cdrec .= pack('V', $this -> old_offset ); // relative offset of local header
		//$this -> old_offset = $new_offset;
		$this -> old_offset += strlen($compressedFile);
		$this -> currentArchiveOffset += strlen($compressedFile);

		$cdrec .= $fileName;

		// optional extra field, file comment goes here
		// save to central directory
		$this -> ctrl_dir[] = $cdrec;
	}
	
	// add central directory and close file archive
	function closeArchive ($archiveFile) {
		$ctrldir = implode('', $this -> ctrl_dir);

		$globalFooter = $ctrldir .
			$this -> eof_ctrl_dir .
			pack('v', sizeof($this -> ctrl_dir)) .		// total # of entries "on this disk"
			pack('v', sizeof($this -> ctrl_dir)) .		// total # of entries overall
			pack('V', strlen($ctrldir)) .				// size of central dir
			pack('V', filesize($archiveFile)) .			// offset to start of central dir
			"\x00\x00";									// .zip file comment length
		
		$archiveHandler = fopen($archiveFile, "a+");
		fwrite ($archiveHandler, $globalFooter);
		fclose ($archiveHandler);

		return(sizeof($this -> ctrl_dir));
	}

} // end of the 'zipfile' class
?>

Posted: Wed Mar 22, 2006 1:10 pm
by Benjamin
There is a unix application called split I believe which could split your files into chunks.

Posted: Wed Mar 22, 2006 1:21 pm
by benkor
Thanks for the reply !

Unfortunately I'm working on a Windows box, and I want to read/compress/send the chunks directly from PHP to the client, as a streamed file, so I think I need a 100% PHP solution.... Any ideas ?

Re: stream zip file in chunks... Is it possible ?*NOW WITH C

Posted: Fri Mar 24, 2006 12:24 am
by redmonkey
benkor wrote:So my question is : can I compress multiple files on-the-fly, chunk by chunk, and send them out as a single streamed zip file to the client ?
Yes.
benkor wrote: UPDATE : I added my code for you to check out :
Basically what I do now is :
get a file from ftp server > read it by chunks > gzwrite chunks in tmp file > read entire tmpfile in memory > add it into archive > close archive > send archive URL for download.
And what I want to do is :
get a file from ftp server > send file headers to client > read it by chunks > compress (or deflate) chunks in memory > send chunks > process next files > send global archive descriptor > voilà, the whole archive has been sent to client without ever having either a whole file in memory or waiting for a file to be completely compressed before sending it.

So what do you think ?
That's what the extended local header section is for. You can output each compressed chunk immediately after compressing it, as well as updating/tracking the CRC you also have to track the compressed size of the file. These are sent in the extended header after the actual compressed data.

Any reason why you can't have the actual servers use their OS's native tools/utils to create the archives prior to downloading? Although it is 'doable' within PHP, I wouldn't think it the most efficient solution.

Posted: Fri Mar 24, 2006 5:24 am
by benkor
Nice to hear it's possible ! Thank you ! :D

My idea was to compress and send each chunk as it is processed, instead of compressing the files OS-side. This way I can precisely track the process status and inform the client about it (especially when processing huge files > 100MB).
I reckon this would be less efficient than letting an OS app do all the work, but I don't know about a simple way to track an OS exe status on this kind of process... And then my script would become OS-dependent too, which I'd like to avoid if possible, even if that means more processing time.

About the CRC and headers, you're absolutely right. I read about the zip file format on various places of the web and got this information too (changing header flag to discard header information and use extended header at the end of the file instead). My main problem is that the resulting file when compressing chunk by chunk is corrupted, as opposed to a file compressed as a whole. So I guess I do something wrong or do not use the right compressor. I tried gzcompress and deflate...

When I compare the output in a hex editor, I see there is only a very little matching part on the beginning of the file. I think this may have something to do with the compressed data blocks size which seems to be different each time, but I can be wrong... Any thougths ?

Posted: Fri Mar 24, 2006 12:04 pm
by redmonkey
I'd suggest you test/debug in parts and with a subset of data files.

Try 'chunk deflating' standard ascii text files then re-inflating them, that's a lot easier to see what's going on than using hex editors looking at binary files. Perhaps 'gzdeflate' would be a better way to go rather than using gzopen and gzwrite?

I'm not going to go through your code line by line but from a quick glance there are some issues.

You are defining CRC, compressed and uncompressed length in both the local and extended header, this confuses some decompression utilities so use one or the other.

The zip file you are creating is not of valid zip file format. You have the extended header section without the extended header signature? Many decompression utils will ignore this but from experience some will refuse to deal with the zip file (most notably Stuffit Expander).

There may be other issues these are just what I see from a quick glance.

Posted: Fri Mar 24, 2006 12:39 pm
by benkor
Thanks for taking the time.

I found the missing extended header signature ( 0x08074b50 ), I will add it to my archive data. I did not test with Stuffit, only with WinRAR and the system extractor on winXP and the default MacOSX extractor.

About defining some header info twice, I commented out the settings for streaming. The code is my working version, which adds a new file to the archive in one time (using the tmp file). I will set the values back to 0 (and the right header flag to 3) when trying again to stream with gzdeflate.

I'll try again like you said with some really simple ASCII files and will keep you updated.

Posted: Mon Mar 27, 2006 10:57 am
by benkor
Hi

I tried gzdeflate as you said, but with no success... Re-inflating a chunk-deflated file returns only the first data block. I used the following code, maybe I'm missing something ?

Code: Select all

<?php
	$chunkSize = 25000;
	
	// open source file
	$sourceFileHandler = fopen("showtime.jpg", "r");
	$compData = "";
	
	while (!feof($sourceFileHandler)) {
		// get chunk string
		$tmpData = fread($sourceFileHandler, $chunkSize);

		// compress
		$tmpCpData = gzdeflate($tmpData);

		// add to compressed data
		$compData .= $tmpCpData;
	}

	// close file
	fclose ($sourceFileHandler);
	
	header('Content-Type: image/jpeg') ;
	echo(gzinflate($compData));
?>
And about streaming the zip file, I tried to put files uncompressed in the archive (header compression method flag to 0), to test my headers / data descriptors / etc... It works without streaming the archive (GP flag to 0, CRC and data lengths defined on file header) but doesn't when I try to define a streamed archive (GP flag changed, CRC & lengths to 0 and then defined in data descriptor)... Or maybe I'm still missing something on that too :(

Any help will be greatly appreciated.

*EDIT* : I made more tests on this last part, it seems that even when the GP flag is modified, the header values are still used to decompress the files : they are not bypassed to use the data descriptor instead... Any ideas ?

Posted: Tue Mar 28, 2006 8:31 am
by benkor
A little update : the streaming now works ! Yay !
I was making a mistake when changing the general purpose flag. I wasn't setting the right bit, so the decompressors didn't see the file as a streamed one, and still used the first header's values...

And about gzdeflating chunks, and trying to re-inflate the whole data file at once, I think I have to change the first data block bit, which indicates that it's the last data block of the current archive (I remember I read something along those lines). I'll try different things about this and will update this thread, if it can be of any help for anyone.