Page 1 of 2
MASS Downloading...
Posted: Sat Dec 04, 2004 9:53 pm
by jclarkkent2003
Hey again all, good night heh?
lol, ight, I need to basically DOWNLOAD ALL pictures from a gallery on a website, don't worry it's legal not breaking any copyrights.
Example, the site's url is:
http://www.domain.com/cgi-bin/imageFolio.cgi
and they have 11,000 images in the gallery.
all the pics are in the
http://www.domain.com/pictures/ folder and they are named randomly. I don't want to download the tn_ thumbnail files.
I COULD write this in php I believe, but my server would probably crash from going through their entire site.
Can i install PHP on my WINDOWS XP HOME SP2 computer that I am using right now? I use it for everything, is it ok to install php? Will it work?
Or do any of you guys know of a better option? Another language or is there a script like web reaper out there that actually works?
Thanks.
Posted: Sat Dec 04, 2004 10:10 pm
by jclarkkent2003
Actually, I know about ereg_replace, and
http://us2.php.net/manual/en/function.ereg-replace.php , I have used it before to make non links into links, but how can I CREATE A SPIDER, that will SPIDER the domain:
http://www.domain.com/
and
http://domain.com/
it gets ALL links, and keeps going through the entire site collecting a IMAGE LINK URL in the middle of the page and writes it to a file, I can write it to the file but how do I collect all the urls, there is only one image per page.
Example:
<img src="
http://www.domain.com/pictures/Akane/XX20.jpg" border=0 alt="XX20.jpg">
is in the middle of the page and has ALOT of html around it. How can I search EACH and every link on every site and collect ONLY those urls in the middle of the page?
Thanks, I feel like i'm talking to my self, it must be late.... lol...
Posted: Sat Dec 04, 2004 11:11 pm
by d3ad1ysp0rk
If it's your site, just ftp to it and download it all through that.
If not, it's not quite nice to waste others' bandwidth like that, and take their images..
Posted: Sun Dec 05, 2004 12:27 am
by jclarkkent2003
Thanks for the completely unrelevant post~!!!!!
Anyone can really help me?
Posted: Sun Dec 05, 2004 1:17 am
by Weirdan
I'm sure we have many members who could help you... if you had explained
why you needed to crawl a website for those images. There are always concerns when one is asking about mass-emailing, web-spiders (obviously you're not talking about a search engine here

), decoding encrypted js files and all that stuff... if you know what I mean...
Posted: Sun Dec 05, 2004 2:07 am
by rehfeld
werd

Posted: Sun Dec 05, 2004 2:18 am
by mudkicker
weirdan : 100% right.
Posted: Sun Dec 05, 2004 2:43 am
by Shendemiar
Getting images from complex cgi driven site is only possible by some macro-script that emulates mouse action. Surprisingly, there is no such combination ready.
I suggest you study autoIT scrip language or something like that, to read screen and simulate mouse action.
Posted: Sun Dec 05, 2004 9:09 am
by timvw
timvw@foo: man wget
Posted: Sun Dec 05, 2004 11:59 am
by jclarkkent2003
ok, one of the reasons is it's a free clipart site which I don't want to experience clicking through 11,000+ images and rightclicking and downloading, and its hella easier to sort through on my own pc rather than going through all their pages.
WGET is only for linux command right? I needed something in php unless I can use wget in php, but I can just include( ) the page, sort through with ereg_expressions and save the Image in the format I gave, I just am not good at ereg expressions AT ALL, and I can FIND IT, but how to I EXTRACT the url from what I find and fwrite into a txt file, then I can download the txt file and flashget all the images.
I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the
http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.
http://us2.php.net/manual/en/function.ereg.php
I appreciate all help I can get.
Thanks~!
Posted: Sun Dec 05, 2004 12:02 pm
by rehfeld
Posted: Sun Dec 05, 2004 12:52 pm
by jclarkkent2003
if(ereg("link=([^\&]{1,100})&image=([^\&]{1,100})",$arrayLinks[$k],$regs)) { echo "$regs[1] $regs[1] <br>"; }
ok, how do I make it so the & sign is escaped or whatever.
http://www.domain.com/cgi-bin/imageFoli ... g&img=&tt=
it should match the "link=Bla-Blah&image=clipart16.jpg" portion of the above url.
I tried escpaing it with a \ before the & but it didn't work. It works when I remove half of the statement such as "link=([^\&]{1,100})", having only that will output "Bla-Blah" like it should.
How do I fix?
Thanks.
Posted: Sun Dec 05, 2004 1:01 pm
by jclarkkent2003
&
has to be as:
&
Just found that out through trial and error, it work snow, but is that the real way?
Posted: Sun Dec 05, 2004 5:00 pm
by timvw
jclarkkent2003 wrote:WGET is only for linux command right?
Not right.
jclarkkent2003 wrote:
I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the
http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.
to get you started, here is a sample of someone that uses curl to emulate wget -mirror
http://curl.haxx.se/programs/curlmirror.txt
Posted: Sun Dec 05, 2004 5:07 pm
by dull1554
if you know where all the images are could you just right a script to cralw the directory and give you a list of all the images
from the manual-user notes:
Code: Select all
I use the function below in my site, is good if you wanna only images files on the filename array.
function GetFileList($dirname, $extensoes = FALSE, $reverso = FALSE)
{
if(!$extensoes) //EXTENSIONS OF FILE YOU WANNA SEE IN THE ARRAY
$extensoes = array("jpg", "png", "jpeg", "gif");
$files = array();
$dir = opendir($dirname);
while(false !== ($file = readdir($dir)))
{
//GET THE FILES ACCORDING TO THE EXTENSIONS ON THE ARRAY
for ($i = 0; $i < count($extensoes); $i++)
{
if (eregi("\.". $extensoes[$i] ."$", $file))
{
$files[] = $file;
}
}
}
//CLOSE THE HANDLE
closedir($dirname);
//ORDER OF THE ARRAY
if ($reverso) {
rsort($files);
} else {
sort($files);
}
return $files;
}