Ok, so:
I can use cURL to get the contents of a page.
I can use regexps to get a list of links.
For an arbitrary page, what is the most general way to fill out relative links? Like, going from "about.html" to "www.site.com/about.html" when I am currently reading "www.site.com/index.html."
I care more about generalization and accuracy than efficiency.
Thank you kindly for help,
alex.
getting absolute addresses: am i oversimplifying?
Moderator: General Moderators
-
kuksenkate
- Forum Newbie
- Posts: 5
- Joined: Mon Jun 01, 2009 12:44 pm
getting absolute addresses: am i oversimplifying?
Last edited by kuksenkate on Tue Jun 02, 2009 4:07 pm, edited 1 time in total.
-
kuksenkate
- Forum Newbie
- Posts: 5
- Joined: Mon Jun 01, 2009 12:44 pm
Re: Help writing site crawler
Ok, another way of phrasing it:
Am I oversimplifying by saying that either I assume the base of the relative url's to be the same as the page I am currently reading, OR there is a specified base address using the <base href=....> tag?
Thank you.
Am I oversimplifying by saying that either I assume the base of the relative url's to be the same as the page I am currently reading, OR there is a specified base address using the <base href=....> tag?
Thank you.