I'm writing my simple parser and everything is ready, but I cannot understand the following thing:
Lets say I have two types of urls:
1. http://somesite.com/somefolder
2. http://somesite.com/somefile.php
If you type the first url which leads to some folder without final slash, the browser WILL ADD the trailing slash automatically. How they understand that they should add it?
But if you type the second url which leads to a file browsers don't add the trailing slash. How they understand it?
So the first URL will look in browsers like:
1. http://somesite.com/somefolder/
2. http://somesite.com/somefile.php
I thought that browsers look at some extensions, e.g. php. But lets swap it and add ".php" to folder's name and remove .php from the file name. Will browsers add the trailing slash correctly?
1. http://somesite.com/somefolder.with.fake.ext.php
2. http://somesite.com/some.file.with.no.ext
The answer: yes, they will! Why? How do they understand it? Anyone knows?
I extremely need to understand this because when I parse URLs in my own script I need to work with contents of downloaded pages so all contents inside should have proper full-urls. That's why it is important if we have trailing slash or not.
If anyone understands what I have described here, please help, thanks!
How IE and FF understand URLs?
Moderator: General Moderators
Re: How IE and FF understand URLs?
It is not the browser that adds the slash to the URL, it is the web server. Since Apache is the most commonly used server on the web, this may shed some light on why and how it does it:
http://httpd.apache.org/docs/1.3/mod/mod_dir.html
In a nutshell, when you ask for a url that does NOT end in a slash and there is no matching document to serve, Apache will look for a directory with the same name as the file you requested, therefore:
http://server.com/dir
Becomes
http://server.com/dir/
At which point, if "dir" exists, Apache will likely serve the directory index for /dir/ (/dir/index.html, if so configured and it exists).
Hope that helps.
http://httpd.apache.org/docs/1.3/mod/mod_dir.html
In a nutshell, when you ask for a url that does NOT end in a slash and there is no matching document to serve, Apache will look for a directory with the same name as the file you requested, therefore:
http://server.com/dir
Becomes
http://server.com/dir/
At which point, if "dir" exists, Apache will likely serve the directory index for /dir/ (/dir/index.html, if so configured and it exists).
Hope that helps.
Re: How IE and FF understand URLs?
oh, you are right!
Lets say I download the page in php using
$handle = fopen("http://www.example.com/somefolder", "r");
is there any way to understand if it was redirected to the url with trailing slash or not?
It is important to know because when I start parsing the contents of the page, the trailing slash matters. Thanks.
Lets say I download the page in php using
$handle = fopen("http://www.example.com/somefolder", "r");
is there any way to understand if it was redirected to the url with trailing slash or not?
It is important to know because when I start parsing the contents of the page, the trailing slash matters. Thanks.
Re: How IE and FF understand URLs?
I'm not sure if you can get the headers with fopen (you may not be able to), but if you use a HTTP library like, say cURL (which you probably should anyway, for any kind of HTTP request heavy lifting), then you can look at the headers of the response. A response code in the 30x range would mean a redirection.