Page 1 of 1

getting absolute addresses: am i oversimplifying?

Posted: Tue Jun 02, 2009 3:38 pm
by kuksenkate
Ok, so:

I can use cURL to get the contents of a page.
I can use regexps to get a list of links.

For an arbitrary page, what is the most general way to fill out relative links? Like, going from "about.html" to "www.site.com/about.html" when I am currently reading "www.site.com/index.html."

I care more about generalization and accuracy than efficiency.

Thank you kindly for help,
alex.

Re: Help writing site crawler

Posted: Tue Jun 02, 2009 4:02 pm
by kuksenkate
Ok, another way of phrasing it:

Am I oversimplifying by saying that either I assume the base of the relative url's to be the same as the page I am currently reading, OR there is a specified base address using the <base href=....> tag?

Thank you.