MASS Downloading...
Moderator: General Moderators
-
jclarkkent2003
- Forum Contributor
- Posts: 123
- Joined: Sat Dec 04, 2004 9:14 pm
MASS Downloading...
Hey again all, good night heh?
lol, ight, I need to basically DOWNLOAD ALL pictures from a gallery on a website, don't worry it's legal not breaking any copyrights.
Example, the site's url is:
http://www.domain.com/cgi-bin/imageFolio.cgi
and they have 11,000 images in the gallery.
all the pics are in the
http://www.domain.com/pictures/ folder and they are named randomly. I don't want to download the tn_ thumbnail files.
I COULD write this in php I believe, but my server would probably crash from going through their entire site.
Can i install PHP on my WINDOWS XP HOME SP2 computer that I am using right now? I use it for everything, is it ok to install php? Will it work?
Or do any of you guys know of a better option? Another language or is there a script like web reaper out there that actually works?
Thanks.
lol, ight, I need to basically DOWNLOAD ALL pictures from a gallery on a website, don't worry it's legal not breaking any copyrights.
Example, the site's url is:
http://www.domain.com/cgi-bin/imageFolio.cgi
and they have 11,000 images in the gallery.
all the pics are in the
http://www.domain.com/pictures/ folder and they are named randomly. I don't want to download the tn_ thumbnail files.
I COULD write this in php I believe, but my server would probably crash from going through their entire site.
Can i install PHP on my WINDOWS XP HOME SP2 computer that I am using right now? I use it for everything, is it ok to install php? Will it work?
Or do any of you guys know of a better option? Another language or is there a script like web reaper out there that actually works?
Thanks.
-
jclarkkent2003
- Forum Contributor
- Posts: 123
- Joined: Sat Dec 04, 2004 9:14 pm
Actually, I know about ereg_replace, and http://us2.php.net/manual/en/function.ereg-replace.php , I have used it before to make non links into links, but how can I CREATE A SPIDER, that will SPIDER the domain:
http://www.domain.com/
and
http://domain.com/
it gets ALL links, and keeps going through the entire site collecting a IMAGE LINK URL in the middle of the page and writes it to a file, I can write it to the file but how do I collect all the urls, there is only one image per page.
Example:
<img src="http://www.domain.com/pictures/Akane/XX20.jpg" border=0 alt="XX20.jpg">
is in the middle of the page and has ALOT of html around it. How can I search EACH and every link on every site and collect ONLY those urls in the middle of the page?
Thanks, I feel like i'm talking to my self, it must be late.... lol...
http://www.domain.com/
and
http://domain.com/
it gets ALL links, and keeps going through the entire site collecting a IMAGE LINK URL in the middle of the page and writes it to a file, I can write it to the file but how do I collect all the urls, there is only one image per page.
Example:
<img src="http://www.domain.com/pictures/Akane/XX20.jpg" border=0 alt="XX20.jpg">
is in the middle of the page and has ALOT of html around it. How can I search EACH and every link on every site and collect ONLY those urls in the middle of the page?
Thanks, I feel like i'm talking to my self, it must be late.... lol...
-
d3ad1ysp0rk
- Forum Donator
- Posts: 1661
- Joined: Mon Oct 20, 2003 8:31 pm
- Location: Maine, USA
-
jclarkkent2003
- Forum Contributor
- Posts: 123
- Joined: Sat Dec 04, 2004 9:14 pm
I'm sure we have many members who could help you... if you had explained why you needed to crawl a website for those images. There are always concerns when one is asking about mass-emailing, web-spiders (obviously you're not talking about a search engine here
), decoding encrypted js files and all that stuff... if you know what I mean...
-
Shendemiar
- Forum Contributor
- Posts: 404
- Joined: Thu Jan 08, 2004 8:28 am
Getting images from complex cgi driven site is only possible by some macro-script that emulates mouse action. Surprisingly, there is no such combination ready.
I suggest you study autoIT scrip language or something like that, to read screen and simulate mouse action.
I suggest you study autoIT scrip language or something like that, to read screen and simulate mouse action.
Last edited by Shendemiar on Sun Dec 05, 2004 2:29 pm, edited 1 time in total.
-
jclarkkent2003
- Forum Contributor
- Posts: 123
- Joined: Sat Dec 04, 2004 9:14 pm
ok, one of the reasons is it's a free clipart site which I don't want to experience clicking through 11,000+ images and rightclicking and downloading, and its hella easier to sort through on my own pc rather than going through all their pages.
WGET is only for linux command right? I needed something in php unless I can use wget in php, but I can just include( ) the page, sort through with ereg_expressions and save the Image in the format I gave, I just am not good at ereg expressions AT ALL, and I can FIND IT, but how to I EXTRACT the url from what I find and fwrite into a txt file, then I can download the txt file and flashget all the images.
I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.
http://us2.php.net/manual/en/function.ereg.php
I appreciate all help I can get.
Thanks~!
WGET is only for linux command right? I needed something in php unless I can use wget in php, but I can just include( ) the page, sort through with ereg_expressions and save the Image in the format I gave, I just am not good at ereg expressions AT ALL, and I can FIND IT, but how to I EXTRACT the url from what I find and fwrite into a txt file, then I can download the txt file and flashget all the images.
I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.
http://us2.php.net/manual/en/function.ereg.php
I appreciate all help I can get.
Thanks~!
-
jclarkkent2003
- Forum Contributor
- Posts: 123
- Joined: Sat Dec 04, 2004 9:14 pm
if(ereg("link=([^\&]{1,100})&image=([^\&]{1,100})",$arrayLinks[$k],$regs)) { echo "$regs[1] $regs[1] <br>"; }
ok, how do I make it so the & sign is escaped or whatever.
http://www.domain.com/cgi-bin/imageFoli ... g&img=&tt=
it should match the "link=Bla-Blah&image=clipart16.jpg" portion of the above url.
I tried escpaing it with a \ before the & but it didn't work. It works when I remove half of the statement such as "link=([^\&]{1,100})", having only that will output "Bla-Blah" like it should.
How do I fix?
Thanks.
ok, how do I make it so the & sign is escaped or whatever.
http://www.domain.com/cgi-bin/imageFoli ... g&img=&tt=
it should match the "link=Bla-Blah&image=clipart16.jpg" portion of the above url.
I tried escpaing it with a \ before the & but it didn't work. It works when I remove half of the statement such as "link=([^\&]{1,100})", having only that will output "Bla-Blah" like it should.
How do I fix?
Thanks.
-
jclarkkent2003
- Forum Contributor
- Posts: 123
- Joined: Sat Dec 04, 2004 9:14 pm
Not right.jclarkkent2003 wrote:WGET is only for linux command right?
to get you started, here is a sample of someone that uses curl to emulate wget -mirrorjclarkkent2003 wrote: I basically just need to find out how I can EXTRACT the image URL from ALL pages in the site, and I don't know how to restrict the spider to only crawling that site, and how to restrict the spider to only adding images in the http://www.domain.com/images/folder/, I don't care about filtering the tn_ right now but that would be handy.
http://curl.haxx.se/programs/curlmirror.txt
- dull1554
- Forum Regular
- Posts: 680
- Joined: Sat Nov 22, 2003 11:26 am
- Location: 42:21:35.359N, 76:02:20.688W
if you know where all the images are could you just right a script to cralw the directory and give you a list of all the images
from the manual-user notes:
from the manual-user notes:
Code: Select all
I use the function below in my site, is good if you wanna only images files on the filename array.
function GetFileList($dirname, $extensoes = FALSE, $reverso = FALSE)
{
if(!$extensoes) //EXTENSIONS OF FILE YOU WANNA SEE IN THE ARRAY
$extensoes = array("jpg", "png", "jpeg", "gif");
$files = array();
$dir = opendir($dirname);
while(false !== ($file = readdir($dir)))
{
//GET THE FILES ACCORDING TO THE EXTENSIONS ON THE ARRAY
for ($i = 0; $i < count($extensoes); $i++)
{
if (eregi("\.". $extensoes[$i] ."$", $file))
{
$files[] = $file;
}
}
}
//CLOSE THE HANDLE
closedir($dirname);
//ORDER OF THE ARRAY
if ($reverso) {
rsort($files);
} else {
sort($files);
}
return $files;
}