url format issue

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

url format issue

Post by pedrotuga »

I am developing my own web-spider and i got stucked pretty much in the beggining.

This might be more of a http question rathar than php code.

Here is the problem:
in order to folow relative links like

href="somepage.html"

I need to know if the url from where i downloaded the page is a directory or a file.
For example, both of the following urls would return the same page
http://mydomain.com/path
http://mydomain.com/path/

if i find links like the one i mentioned before they would only take me to the right destiny in one of the cases.
How do i go around this problem?


in other words: how do i check the path of a given url?

thanks
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

parse_url() may be of interest.
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

superdezign wrote:parse_url() may be of interest.
that function is practical but the problem i menstion above still exists anyway.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Then maybe you should simplify what your after with some code. Sometimes, it's easier to understand than your explanation.
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

superdezign wrote:Then maybe you should simplify what your after with some code. Sometimes, it's easier to understand than your explanation.
If i don't know how to face the problem what code would i post?
Ok, i can show how the problem came:

Code: Select all

$url="http://example.com/path";
$url_pieces=parse_url($url);
$page=file_get_contents($url);

//here goes some regex and stuff to fetch the links in $links
//an some other checking
//now here comes the trouble when i try to rebuild relative links

foreach ($link as $link){
     if ($link[0]=="/"){
          $link="http://".$url_pieces["domain"].$link;
     }
     elseif(preg_match ( "^http://",  $link)=="1"){
             //well... nothing to do in this case
     }
      else{
          //here is when i have trouble....
          //when do i now if I should...
         $link="http://".$url_pieces["domain"]."/".$url_pieces["path"]."/".$link;
          //or...
         $link="http://".$url_pieces["domain"]."/".$url_pieces["path"].$link;
         //note that in one case i add a slash and in another I didn't...

     }

}

Related to this...
When my browser downloads a page like:
http://domain.com/path
how does it knows when to add a slash in the end and when not to?
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

I'm not sure, but I don't think your browser adds the slash. I think the server you're connecting to does it. It knows what is and isn't a directory and what does or doesn't have an index file.

Well, firstly, you aren't using parse_url() correctly. There is no 'domain' index. Also, i'm not sure if file_get_contents() will return a file if you are just given a URL without a filename attached to it (however, looking at your code, you'd know better than I do is file_get_contents() worked).

If file_get_contents() doesn't work, it returns false. So, then you'd try different, common values (unless there's a way to determine the index file through PHP) such as index.html, index.php, index.cfm, etc. If it stil doesn't work, it's almost safe to assume that the file doesn't exist.


As for using parse_url, you should read the documentation. It gives you the indexes 'scheme', 'host', 'user', 'pass', 'path', 'query', and 'fragment' from a URL, and returns false if the URL is invalid.


You've pretty much got everything that you need aside from access to the opposing servers.
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

I didn't test that code, i just wrote it to illustrate my problem. the correct index name would be 'host' not 'domain', sorry about that.

file_get_content() will work in this case. If the argument in an url it will return the result of the http request, without the headers of course, and put it into a string.

Ok... maybe somebody that is familiar with the HTTP protocol could give a help here.
Does the server answers the same when requested:
http://mydomain.com/path
and
http://mydomain.com/path/
?

if so how do my browser knows hot to handle relative links?


superdezign, as for the server adding a slash... i am not sure about this but, as far as I know the server only repplies to requests being the answers the response header and the data, so i don't see how would the server add a slash. But this is just me thinking, as i said i don't master the protocol.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

As in when you go the server and request http://mydomain.com/path, it will send you to http://mydomain.com/path/, which sends you to http://mydomain.com/path/index.php.

Of course, I'm not positive, but I'm pretty sure that the request is handled by the server, not the browser. Otherwise, servers couldn't make http://mydomain.com/images/photo.jpg counted as http://mydomain.com/images/photo (which a lot of them do).

The browser doesn't have information on the file structure of the server, the server does.
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

superdezign wrote:As in when you go the server and request http://mydomain.com/path, it will send you to http://mydomain.com/path/, which sends you to http://mydomain.com/path/index.php.

Of course, I'm not positive, but I'm pretty sure that the request is handled by the server, not the browser. Otherwise, servers couldn't make http://mydomain.com/images/photo.jpg counted as http://mydomain.com/images/photo (which a lot of them do).

The browser doesn't have information on the file structure of the server, the server does.

superdezign, please don't take this as an ofence, I think you are more confused than me. The server can't "send you to" anywhere, it only replies to http requests. If you request http:mydomain.com/path, and if there is a folder with that name it will send you the first index file it finds on the list defined in it's configuration file - httpd.conf for example. The server itself cant change the address you request. At the maximum it can send a redirect response and let the client decide what to do with it.

Please let's stay focused on my original questions... given an url like http://doamin.com/path , how do i find out if 'path' is a file or a folder?
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Well, if file_get_contents() works on a path that ends in a folder, why does it matter?

If file_get_contents() didn't work on folders, then that'd be your solution. I'd suggest regex, but I'm not entirely sure that folders can't have periods in the middle of them.


However, if you'd like to limit the types of files you check (*.php, *.html, *.aspx, etc.), then you could easily regex those values.
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

superdezign wrote:Well, if file_get_contents() works on a path that ends in a folder, why does it matter?
It maters because i need to follow links, and when they are relative links i need to rebuild them, and in order to do that i need to know the path, to know the path i need to know if the last piece of url refers to a folder or a file.

If file_get_contents() didn't work on folders, then that'd be your solution. I'd suggest regex, but I'm not entirely sure that folders can't have periods in the middle of them.
As far as i know they can very well have. Otherwise that would be a solution. And we have to consider that a lot of sites use mod_rewrite.
However, if you'd like to limit the types of files you check (*.php, *.html, *.aspx, etc.), then you could easily regex those values.
That is another common mistake. The content type in http protocol is not defined by the extension of the file.

does the server specifies in the response header if the request hit a directory?
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

I tried messing around with the headers and no, the server doesn't give any clue to whether or not you're in a directory or on a file. Makes me wonder how accurate the other spiders on the web actually are....

I think I've got a pretty solid solution though... Mixing parse_url() with pathinfo().

Code: Select all

$urlParts = parse_url($url);
if(!isset($urlParts['path']))
{
    // You're in a directory (the main one)
    $relativeDir = $url;
}
else
{
    $pathParts = pathinfo($urlParts['path']);
   
    if(!isset($pathParts['basename']))
    {
        // You're in a directory
        $relativeDir = $url;
    }
    else
    {
        $relativeDir = $urlParts['scheme'] . '://' . $urlParts['host'] . $pathParts['dirname'];
    }
}

echo $relativeDir;
Try it out on a few URLs and see what you get.

Edit: Not quite. pathinfo() isn't exactly magical, but at least I think we're making progress.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Okay, since it was frustrating me, I played around with it some more.

Code: Select all

$pathParts = pathinfo($url);

if(!isset($pathParts['basename']))
{
	// You're in a directory
	$relativeDir = $url;
}
else if(!isset($pathParts['extension']))
{
	$relativeDir = $pathParts['dirname'] . '/' . $pathParts['basename'];
}
else
{
	$relativeDir = $pathParts['dirname'];
}

echo $relativeDir;
This uses only pathinfo(). It's pretty solid, except when people have files named without extensions. I guess you'll have to come up with a way to combat that.
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

Ok, i will try a couple of urls...

At this point i am also wondering how accurate are crawlers out there

I am still skeptical, it should be a way to find out
User avatar
pedrotuga
Forum Contributor
Posts: 249
Joined: Tue Dec 13, 2005 11:08 pm

Post by pedrotuga »

ok... i tried it.

There must be a way.

how da heck do i know if the url refers to directory or to a file.

Like.. if request a directory to my site my browser quickly ads a slash at the end.
If i request for example http://en.wikipedia.org/wiki/Article
it doesn't. Why?
Post Reply