Page 1 of 2
url format issue
Posted: Thu May 31, 2007 5:46 pm
by pedrotuga
I am developing my own web-spider and i got stucked pretty much in the beggining.
This might be more of a http question rathar than php code.
Here is the problem:
in order to folow relative links like
href="somepage.html"
I need to know if the url from where i downloaded the page is a directory or a file.
For example, both of the following urls would return the same page
http://mydomain.com/path
http://mydomain.com/path/
if i find links like the one i mentioned before they would only take me to the right destiny in one of the cases.
How do i go around this problem?
in other words: how do i check the path of a given url?
thanks
Posted: Thu May 31, 2007 6:37 pm
by superdezign
parse_url() may be of interest.
Posted: Fri Jun 01, 2007 6:53 am
by pedrotuga
that function is practical but the problem i menstion above still exists anyway.
Posted: Fri Jun 01, 2007 7:14 am
by superdezign
Then maybe you should simplify what your after with some code. Sometimes, it's easier to understand than your explanation.
Posted: Fri Jun 01, 2007 12:11 pm
by pedrotuga
superdezign wrote:Then maybe you should simplify what your after with some code. Sometimes, it's easier to understand than your explanation.
If i don't know how to face the problem what code would i post?
Ok, i can show how the problem came:
Code: Select all
$url="http://example.com/path";
$url_pieces=parse_url($url);
$page=file_get_contents($url);
//here goes some regex and stuff to fetch the links in $links
//an some other checking
//now here comes the trouble when i try to rebuild relative links
foreach ($link as $link){
if ($link[0]=="/"){
$link="http://".$url_pieces["domain"].$link;
}
elseif(preg_match ( "^http://", $link)=="1"){
//well... nothing to do in this case
}
else{
//here is when i have trouble....
//when do i now if I should...
$link="http://".$url_pieces["domain"]."/".$url_pieces["path"]."/".$link;
//or...
$link="http://".$url_pieces["domain"]."/".$url_pieces["path"].$link;
//note that in one case i add a slash and in another I didn't...
}
}
Related to this...
When my browser downloads a page like:
http://domain.com/path
how does it knows when to add a slash in the end and when not to?
Posted: Fri Jun 01, 2007 12:37 pm
by superdezign
I'm not sure, but I don't think your browser adds the slash. I think the server you're connecting to does it. It knows what is and isn't a directory and what does or doesn't have an index file.
Well, firstly, you aren't using
parse_url() correctly. There is no 'domain' index. Also, i'm not sure if file_get_contents() will return a file if you are just given a URL without a filename attached to it (however, looking at your code, you'd know better than I do is file_get_contents() worked).
If
file_get_contents() doesn't work, it returns false. So, then you'd try different, common values (unless there's a way to determine the index file through PHP) such as index.html, index.php, index.cfm, etc. If it stil doesn't work, it's almost safe to assume that the file doesn't exist.
As for using parse_url, you should read the documentation. It gives you the indexes 'scheme', 'host', 'user', 'pass', 'path', 'query', and 'fragment' from a URL, and returns false if the URL is invalid.
You've pretty much got everything that you need aside from access to the opposing servers.
Posted: Fri Jun 01, 2007 1:29 pm
by pedrotuga
I didn't test that code, i just wrote it to illustrate my problem. the correct index name would be 'host' not 'domain', sorry about that.
file_get_content() will work in this case. If the argument in an url it will return the result of the http request, without the headers of course, and put it into a string.
Ok... maybe somebody that is familiar with the HTTP protocol could give a help here.
Does the server answers the same when requested:
http://mydomain.com/path
and
http://mydomain.com/path/
?
if so how do my browser knows hot to handle relative links?
superdezign, as for the server adding a slash... i am not sure about this but, as far as I know the server only repplies to requests being the answers the response header and the data, so i don't see how would the server add a slash. But this is just me thinking, as i said i don't master the protocol.
Posted: Fri Jun 01, 2007 1:39 pm
by superdezign
As in when you go the server and request
http://mydomain.com/path, it will send you to
http://mydomain.com/path/, which sends you to
http://mydomain.com/path/index.php.
Of course, I'm not positive, but I'm pretty sure that the request is handled by the server, not the browser. Otherwise, servers couldn't make
http://mydomain.com/images/photo.jpg counted as
http://mydomain.com/images/photo (which a lot of them do).
The browser doesn't have information on the file structure of the server, the server does.
Posted: Fri Jun 01, 2007 1:56 pm
by pedrotuga
superdezign, please don't take this as an ofence, I think you are more confused than me. The server can't "send you to" anywhere, it only replies to http requests. If you request http:mydomain.com/path, and if there is a folder with that name it will send you the first index file it finds on the list defined in it's configuration file - httpd.conf for example. The server itself cant change the address you request. At the maximum it can send a redirect response and let the client decide what to do with it.
Please let's stay focused on my original questions... given an url like
http://doamin.com/path , how do i find out if 'path' is a file or a folder?
Posted: Fri Jun 01, 2007 2:03 pm
by superdezign
Well, if file_get_contents() works on a path that ends in a folder, why does it matter?
If file_get_contents() didn't work on folders, then that'd be your solution. I'd suggest regex, but I'm not entirely sure that folders can't have periods in the middle of them.
However, if you'd like to limit the types of files you check (*.php, *.html, *.aspx, etc.), then you could easily regex those values.
Posted: Fri Jun 01, 2007 2:14 pm
by pedrotuga
superdezign wrote:Well, if file_get_contents() works on a path that ends in a folder, why does it matter?
It maters because i need to follow links, and when they are relative links i need to rebuild them, and in order to do that i need to know the path, to know the path i need to know if the last piece of url refers to a folder or a file.
If file_get_contents() didn't work on folders, then that'd be your solution. I'd suggest regex, but I'm not entirely sure that folders can't have periods in the middle of them.
As far as i know they can very well have. Otherwise that would be a solution. And we have to consider that a lot of sites use mod_rewrite.
However, if you'd like to limit the types of files you check (*.php, *.html, *.aspx, etc.), then you could easily regex those values.
That is another common mistake. The content type in http protocol is not defined by the extension of the file.
does the server specifies in the response header if the request hit a directory?
Posted: Fri Jun 01, 2007 2:50 pm
by superdezign
I tried messing around with the headers and no, the server doesn't give any clue to whether or not you're in a directory or on a file. Makes me wonder how accurate the other spiders on the web actually are....
I think I've got a pretty solid solution though... Mixing
parse_url() with
pathinfo().
Code: Select all
$urlParts = parse_url($url);
if(!isset($urlParts['path']))
{
// You're in a directory (the main one)
$relativeDir = $url;
}
else
{
$pathParts = pathinfo($urlParts['path']);
if(!isset($pathParts['basename']))
{
// You're in a directory
$relativeDir = $url;
}
else
{
$relativeDir = $urlParts['scheme'] . '://' . $urlParts['host'] . $pathParts['dirname'];
}
}
echo $relativeDir;
Try it out on a few URLs and see what you get.
Edit: Not quite. pathinfo() isn't exactly magical, but at least I think we're making progress.
Posted: Fri Jun 01, 2007 3:05 pm
by superdezign
Okay, since it was frustrating me, I played around with it some more.
Code: Select all
$pathParts = pathinfo($url);
if(!isset($pathParts['basename']))
{
// You're in a directory
$relativeDir = $url;
}
else if(!isset($pathParts['extension']))
{
$relativeDir = $pathParts['dirname'] . '/' . $pathParts['basename'];
}
else
{
$relativeDir = $pathParts['dirname'];
}
echo $relativeDir;
This uses only pathinfo(). It's pretty solid, except when people have files named without extensions. I guess you'll have to come up with a way to combat that.
Posted: Fri Jun 01, 2007 3:21 pm
by pedrotuga
Ok, i will try a couple of urls...
At this point i am also wondering how accurate are crawlers out there
I am still skeptical, it should be a way to find out
Posted: Fri Jun 01, 2007 3:52 pm
by pedrotuga
ok... i tried it.
There must be a way.
how da heck do i know if the url refers to directory or to a file.
Like.. if request a directory to my site my browser quickly ads a slash at the end.
If i request for example
http://en.wikipedia.org/wiki/Article
it doesn't. Why?