Page 1 of 1

Saving a cached copy of a link

Posted: Wed Apr 25, 2007 2:35 pm
by mrhoopz
Ok, I have a website where users post some summary information about an article and a link to the actual article. These can then be searched by anyone. Some of the links invariably become dead so I would like to save a copy of them on the server so that if the link is dead a user can click on something like 'View Archived Content' and they can see the original article.

PDF's are no problem. HTML is. I tried using PHP to download the HTML file that the link points to, but then you don't get images. What I'd like to do is something similar to what you do when you save a web page and all the images with it, but I obviously need to do it in PHP.

Any suggestions would be greatly appreciated, thanks!

Posted: Wed Apr 25, 2007 2:55 pm
by Burrito
you could use file_get_contents() to get the HTML layout then parse the string and look for images. You can then use file_get_contents() again for the images and save them to your server.

keep in mind, if you're getting this stuff from an external site, you need to make sure you have permission to save the data (images) to your site so as to avoid any copyright violations.

Posted: Wed Apr 25, 2007 3:40 pm
by mrhoopz
I'm aware of the copyright issues here, and viewing the cached content will only be available to users on my local intranet, and not to anyone outside of it.

Thanks for the tip, though, it looks like it should work, although I'm not sure of the best way to parse the string to look for images.

Posted: Wed Apr 25, 2007 3:43 pm
by Burrito
use a regular expression and look for the <img> tag and pull out the content from the src attribute.