Page 1 of 1

determining file type of remote file

Posted: Wed Sep 24, 2003 8:07 am
by pootergeist
I've a reciprocal link testing class that evaluates all anchor tags on a page looking for a return link. If not found on that page it tests all the found hrefs that are on the domain in question, etc etc to a defined nested depth.
This all works fine with some code for <base tags, mailtos and framesets - it follows the correct pages and stops after a set number of pages or once the linkback is found.

I figured though that it would try to access things like jpegs, exes, zips etc if they were the href of an anchor tag.

My quick workaround was to build an array of bad extensions array('.jpg','.exe','.zip'); etc and test strpos - if the extension appears in the remote url it would be ignored.

Is there a better way? Perhaps with returning a filetype text/html text/xml etc to force the script to only scan webpages? Can remote filetyping be tested and is it dependent upon the servers involved?

Posted: Wed Sep 24, 2003 12:42 pm
by Albright
The "bad extensions" thing is a start, but maybe it would be more efficient to do a "good extensions" array ('.html','.htm','.php', etc). Of course, this will fail with links that don't have a filename ( http://www.spam.com or http://www.google.com/ ) so you'll have to include workarounds in that case.

As for determining the filetype of a remote file, I don't think that's possible. Your best bet might be to download the remote file to the local machine and then run the requisite tests on it.

Posted: Thu Sep 25, 2003 7:38 am
by pootergeist
I did weigh up the options of testing for good or testign for bad - figured that with all the possible language extensions .asp .cfm .c .jsp etc both lists would still grow quite large - cheers for replying anway.