MASS Downloading...

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

jclarkkent2003
Forum Contributor
Posts: 123
Joined: Sat Dec 04, 2004 9:14 pm

MASS Downloading...

Post by jclarkkent2003 »

Hey again all, good night heh?

lol, ight, I need to basically DOWNLOAD ALL pictures from a gallery on a website, don't worry it's legal not breaking any copyrights.

Example, the site's url is:

http://www.domain.com/cgi-bin/imageFolio.cgi

and they have 11,000 images in the gallery.

all the pics are in the
http://www.domain.com/pictures/ folder and they are named randomly. I don't want to download the tn_ thumbnail files.

I COULD write this in php I believe, but my server would probably crash from going through their entire site.

Can i install PHP on my WINDOWS XP HOME SP2 computer that I am using right now? I use it for everything, is it ok to install php? Will it work?

Or do any of you guys know of a better option? Another language or is there a script like web reaper out there that actually works?

Thanks.
jclarkkent2003
Forum Contributor
Posts: 123
Joined: Sat Dec 04, 2004 9:14 pm

Post by jclarkkent2003 »

Actually, I know about ereg_replace, and http://us2.php.net/manual/en/function.ereg-replace.php , I have used it before to make non links into links, but how can I CREATE A SPIDER, that will SPIDER the domain:

http://www.domain.com/
and
http://domain.com/

it gets ALL links, and keeps going through the entire site collecting a IMAGE LINK URL in the middle of the page and writes it to a file, I can write it to the file but how do I collect all the urls, there is only one image per page.

Example:
<img src="http://www.domain.com/pictures/Akane/XX20.jpg" border=0 alt="XX20.jpg">

is in the middle of the page and has ALOT of html around it. How can I search EACH and every link on every site and collect ONLY those urls in the middle of the page?

Thanks, I feel like i'm talking to my self, it must be late.... lol...
d3ad1ysp0rk
Forum Donator
Posts: 1661
Joined: Mon Oct 20, 2003 8:31 pm
Location: Maine, USA

Post by d3ad1ysp0rk »

If it's your site, just ftp to it and download it all through that.

If not, it's not quite nice to waste others' bandwidth like that, and take their images..
jclarkkent2003
Forum Contributor
Posts: 123
Joined: Sat Dec 04, 2004 9:14 pm

Post by jclarkkent2003 »

Thanks for the completely unrelevant post~!!!!!

Anyone can really help me?
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

I'm sure we have many members who could help you... if you had explained why you needed to crawl a website for those images. There are always concerns when one is asking about mass-emailing, web-spiders (obviously you're not talking about a search engine here :D ), decoding encrypted js files and all that stuff... if you know what I mean...
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

werd :wink:
User avatar
mudkicker
Forum Contributor
Posts: 479
Joined: Wed Jul 09, 2003 6:11 pm
Location: Istanbul, TR
Contact:

Post by mudkicker »

weirdan : 100% right.
Shendemiar
Forum Contributor
Posts: 404
Joined: Thu Jan 08, 2004 8:28 am

Post by Shendemiar »

Getting images from complex cgi driven site is only possible by some macro-script that emulates mouse action. Surprisingly, there is no such combination ready.

I suggest you study autoIT scrip language or something like that, to read screen and simulate mouse action.
Last edited by Shendemiar on Sun Dec 05, 2004 2:29 pm, edited 1 time in total.
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

timvw@foo: man wget
jclarkkent2003
Forum Contributor
Posts: 123
Joined: Sat Dec 04, 2004 9:14 pm

Post by jclarkkent2003 »

ok, one of the reasons is it's a free clipart site which I don't want to experience clicking through 11,000+ images and rightclicking and downloading, and its hella easier to sort through on my own pc rather than going through all their pages.

WGET is only for linux command right? I needed something in php unless I can use wget in php, but I can just include( ) the page, sort through with ereg_expressions and save the Image in the format I gave, I just am not good at ereg expressions AT ALL, and I can FIND IT, but how to I EXTRACT the url from what I find and fwrite into a txt file, then I can download the txt file and flashget all the images.

I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.

http://us2.php.net/manual/en/function.ereg.php

I appreciate all help I can get.

Thanks~!
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

jclarkkent2003
Forum Contributor
Posts: 123
Joined: Sat Dec 04, 2004 9:14 pm

Post by jclarkkent2003 »

if(ereg("link=([^\&]{1,100})&image=([^\&]{1,100})",$arrayLinks[$k],$regs)) { echo "$regs[1] $regs[1] <br>"; }

ok, how do I make it so the & sign is escaped or whatever.

http://www.domain.com/cgi-bin/imageFoli ... g&img=&tt=

it should match the "link=Bla-Blah&image=clipart16.jpg" portion of the above url.

I tried escpaing it with a \ before the & but it didn't work. It works when I remove half of the statement such as "link=([^\&]{1,100})", having only that will output "Bla-Blah" like it should.

How do I fix?

Thanks.
jclarkkent2003
Forum Contributor
Posts: 123
Joined: Sat Dec 04, 2004 9:14 pm

Post by jclarkkent2003 »

&

has to be as:

&

Just found that out through trial and error, it work snow, but is that the real way?
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

jclarkkent2003 wrote:WGET is only for linux command right?
Not right.

jclarkkent2003 wrote: I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.
to get you started, here is a sample of someone that uses curl to emulate wget -mirror

http://curl.haxx.se/programs/curlmirror.txt
User avatar
dull1554
Forum Regular
Posts: 680
Joined: Sat Nov 22, 2003 11:26 am
Location: 42:21:35.359N, 76:02:20.688W

Post by dull1554 »

if you know where all the images are could you just right a script to cralw the directory and give you a list of all the images
from the manual-user notes:

Code: Select all

I use the function below in my site, is good if you wanna only images files on the filename array.
function GetFileList($dirname, $extensoes = FALSE, $reverso = FALSE)
{
if(!$extensoes) //EXTENSIONS OF FILE YOU WANNA SEE IN THE ARRAY
$extensoes = array("jpg", "png", "jpeg", "gif");

  $files = array(); 
  $dir = opendir($dirname);
while(false !== ($file = readdir($dir)))
{
//GET THE FILES ACCORDING TO THE EXTENSIONS ON THE ARRAY
for ($i = 0; $i < count($extensoes); $i++)
{
if (eregi("\.". $extensoes[$i] ."$", $file)) 
{
  $files[] = $file;
}
}
}
//CLOSE THE HANDLE
closedir($dirname);
//ORDER OF THE ARRAY
   if ($reverso) {
       rsort($files);
   } else {
       sort($files);
   }
  return $files; 
}
Locked