Page 1 of 1

getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 7:33 pm
by buckit
here is the deal... I am working on an ecommerce site and I am building an image manager of sorts.

my online company uses 3 main distributors. each one has their product images online name with their product ID.

the product images are all stored in a folder on the site (/images/productimages/). the path to the file is stored in the database.

what I am doing now:

1) I get a recordset of all products. I take the path to the image and do an if file_exists to determine if the product has a broken image (or no image path in database) or if its fine. if it finds the image in the folder then it makes certain that it is a valid image file (getimagesize then check that it returns an array). the script outputs at the end, how many broken images it found.

2) next you can press a button to "download images from distributor". that button runs a script that creates an array of distributor IDs (each product has that id in database) and image URL. it then does a foreach on that array. the foreach gets a record set of all products from that distributor and cycles through that in a similar fashion as step 1 above. if its a broken image then it tries to download it from the distributor... if it doesnt find it then it puts a "product has no image" image.



my question... how do I make this faster?? is my logic way off? is there a better way of doing this? the productimages folder has over 12,000 images in it... it takes about 176 seconds just to see if there are any broken images or not.

just thought I would ask... am pretty new to PHP.

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 7:36 pm
by Eran
As far as I can tell your logic is not off. It would help seeing some actual code, to see if there are implementation optimizations that can be made

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 7:42 pm
by buckit
this is the basic code that tells initially if there are any broken images:

Code: Select all

$imageexist = 0;
$imagebroken = 0;
$imagepath = '../images/';
$sql_images = "SELECT products_image FROM products";
$results = $db->Execute($sql_images);
while(!$results->EOF){
    if(file_exists($imagepath . $results->fields['products_image'])){
        $imageexist++;
    }else{
        $imagebroken++;
    }
    $results->MoveNext();
}
?>

here is the code that downloads images from distributor:

Code: Select all

if ($_GET['a'] == "getimages"){
    
    $dist_array = array(
                       "distributor1"=>array("url"=>"http://www. distributor1.com/images/products/",
                                      "direct"=>"30",
                                      "drop"=>"31"),
                       "distributor2"=>array("url"=>"http://www. distributor2.com/pimages/",
                                      "direct"=>"20",
                                      "drop"=>"21"),
                       "distributor3"=>array("url"=>"http://images. distributor3.net/prodpics/",
                                      "direct"=>"40",
                                      "drop"=>"41")
                       );
    
    
    $gsimagedownloaded = 0;
    $rsrimagedownloaded = 0;
    $valorimagedownloaded = 0;
    $noimage = 0;
    $imagepath = "../images/";
    foreach($dist_array as $dist => $val){
        $sql_images = "SELECT products_image, products_model FROM products WHERE (product_distributor = ".$dist_array[$dist]['direct']." OR product_distributor = ".$dist_array[$dist]['drop'].")";
        $results = $db->Execute($sql_images);
            while(!$results->EOF){
                if(empty($results->fields['products_image']) || !file_exists($imagepath . $results->fields['products_image']) ){
                    if(empty($results->fields['products_image'])){
                        $img_name = $results->fields['model'];
                    }else{
                        $img_name = str_replace('ProductImages/', '', $results->fields['products_image']);
                    }
                                             
                    $image_url = $dist_array[$dist]['url'] . $img_name;
                    $img = getimagesize($image_url);
                    if(!is_array($img)) {
                        $noimage++;
                    }else{
                        file_put_contents($imagepath . $results->fields['products_image'], file_get_contents($image_url));
                        $imagedownloaded++;
                    }
 
                }
            $results->MoveNext();
            }
    }
}

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 8:13 pm
by buckit
I think I have been looking at one part of this for too long... I completely forgot that I enabled this part to run on load! it goes through ALL files in the folder to determine if the x and y axis are equal... it will report how many of them are NOT equal and let you run a script to auto resize them so they are square.

I think this part is whats making it take so long to load... I previously had this part run on command rather than onload... forgot I did that :)

here is that code:

Code: Select all

//find images that need resized
$file_count = 0;
if($handle = opendir($path)) 
{ 
    while($file = readdir($handle)) 
    { 
        clearstatcache(); 
        $imgpath = $path . "/" . $file;
        $image_info = getimagesize($imgpath);
        if(is_array($image_info)){
            switch($image_info['mime']){
                case 'image/jpeg': $fileimg = @imagecreatefromjpeg ($imgpath); break;
                case 'image/gif': $fileimg = @imagecreatefromgif ($imgpath); break;
            }
            if ((@imagesx($fileimg)) && (@imagesy($fileimg))){
                if (imagesx($fileimg) != imagesy($fileimg)){
                    $file_count++;
                }
            }
        } 
    }
    closedir($handle); 
} 
 
is there a better way to do that?

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 8:20 pm
by Eran
Regarding the image downloading script, it doesn't look like there is much room to optimize here. The only major thing I can think of is first downloading the images and then running getimagesize() on the local copy, since I believe that it has to download the image anyway to get information on it. As it stands currently, for all provider images are downloaded twice (I could be wrong though, and getimagesize() can somehow get its information without downloading the entire image).

As for your x-y comparison script, you should lose all of the '@' operator (error suppression). What those do in practice is turning off error reporting and then turning it back on - if you want to suppress errors for your entire script you should do it only once at the beginning. That being said, it would probably not have a major impact on the performance of the script. When dealing with so many files, the process is bound to be choked by hard-drive performance which is much slower than CPU / memory.

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 8:26 pm
by buckit
pytrin wrote:Regarding the image downloading script, it doesn't look like there is much room to optimize here. The only major thing I can think of is first downloading the images and then running getimagesize() on the local copy, since I believe that it has to download the image anyway to get information on it. As it stands currently, for all provider images are downloaded twice (I could be wrong though, and getimagesize() can somehow get its information without downloading the entire image).
Thanks for the advice! I did it this way because it will create a blank file if no image exists at that given image location. so if I do that and then do getimagesize() I will have to delete that file if it isn't an image (I know... not a big deal). is there a better way to tell if an image exists at a URL without downloading content? I don't suppose there would be.
pytrin wrote: As for your x-y comparison script, you should lose all of the '@' operator (error suppression). What those do in practice is turning off error reporting and then turning it back on - if you want to suppress errors for your entire script you should do it only once at the beginning. That being said, it would probably not have a major impact on the performance of the script. When dealing with so many files, the process is bound to be choked by hard-drive performance which is much slower than CPU / memory.
you are 100% correct on the bottleneck! I just wanted to be sure I was using the best code to do this... everything I know about PHP I have learned from google... this is the first time I have ever asked another coder anything about PHP. (I think I have done pretty good with google so far! :) )

thanks for all your help! hopefully I can help someone else out on this board sometime! :)

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 8:40 pm
by Eran
is there a better way to tell if an image exists at a URL without downloading content? I don't suppose there would be.
You could use get_headers() and checking the HTTP status (404 means broken link etc.)
This would also remove the necessity of using the error suppression operator, since get_headers() does not throw an error on a missing link.

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 8:49 pm
by buckit
pytrin wrote:
is there a better way to tell if an image exists at a URL without downloading content? I don't suppose there would be.
You could use get_headers() and checking the HTTP status (404 means broken link etc.)
This would also remove the necessity of using the error suppression operator, since get_headers() does not throw an error on a missing link.

the only reason I used the error suppression operator is because of possible corrupt images... I had one come about the other day. imagecreatefromjpeg and imagesx, imagesy would throw an error on that file and stop the script. the getimagesize would not catch the bad image as it was enough to return an array. so I suppress the errors and handle the bad file if those functions fail.

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 8:54 pm
by Eran
Error suppression is for display purposes only. You can't really suppress fatal errors (errors that would stop the script), just stop them showing on screen.

Re: getting information on all file in a folder (12000+ files)

Posted: Tue Jan 19, 2010 9:00 pm
by buckit
pytrin wrote:Error suppression is for display purposes only. You can't really suppress fatal errors (errors that would stop the script), just stop them showing on screen.

sorry... thats what I meant... mind is slowly melting tonight :) I only did it on that part because I actually want to know if there are other errors so i can correct the script if needed... in this case it just needs to delete the file and start over.