Page 1 of 1

Open html file, download images

Posted: Mon Feb 18, 2013 5:51 am
by JKM
Hi there!

I want to open a html file (http://xx.se/album123.html), parse through the file, find all links that contains "/pix.php?source=", and wget/download the link after source=:
"/pix.php?source=http://xx.se/albumfiles/Jfnb83njHfm2kJ.jpg"

- There might be multiple image links in the html file

Re: Open html file, download images

Posted: Mon Feb 18, 2013 6:53 am
by s.dot
Is this legal behavior?

Anyways you would open the file with file_get_contents() [among other ways]
Parse the file for links using a regular expression - preg_match_all().
Loop through the matched links and download the link match file, again using file_get_contents() or another similar way.

What have you tried?

Re: Open html file, download images

Posted: Mon Feb 18, 2013 4:50 pm
by JKM
s.dot wrote:Is this legal behavior?

Anyways you would open the file with file_get_contents() [among other ways]
Parse the file for links using a regular expression - preg_match_all().
Loop through the matched links and download the link match file, again using file_get_contents() or another similar way.

What have you tried?
I see why you think it's illegal behaviour, but it's my images I want to download.

I haven't coded anything for almost two years, and I've always been terrible with RegEx, so I might need some help with that. :p (I just need href="pix.php?source=X")

Thanks :)

Re: Open html file, download images

Posted: Tue Feb 19, 2013 5:24 am
by s.dot
Well some pseudo code might go a little bit like this

Code: Select all

<?php

//html file you want to open
$htmlFile = 'http://www.example.com/page.html';

if ($htmlFileContents = file_get_contents($htmlFile))
{
    //echo $htmlFileContents; should show the source of the html file
    //attempt to match links
    preg_match_all('/\?source=(.+?)\"/im', $htmlFileContents, $matches, PREG_SET_ORDER);

    if (!empty($matches))
    {
        //print_r($matches); see what you have here
        foreach ($matches AS $match)
        {
            //I believe $match[1] will have the link...
            //use header() to download to client, or grab the file content to write to server
        }
    }
}
It would be something like that. That is the basic structure for what you want. The regular expression may be wrong and I don't know how you want to save the files.