building a URL string from parts

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
pgolovko
Forum Commoner
Posts: 38
Joined: Sun Sep 17, 2006 9:13 am

building a URL string from parts

Post by pgolovko »

Code: Select all

function GetUrls($url)
   {
     $info = @parse_url($url); // parse the url
     
     $html = $this->temp_everything; // gets what was sent back
     if (!$html) // check it's not false
     {
       return false; // if it is return false
     }
     
     $pieces = preg_split ("/(\r\n\r\n|\r\r|\n\n)/", $html, 2); // split the HTML from the headers
     $html = $pieces[1]; // save the HTML
     unset($pieces); // unset everything else
     
     // find all the urls
     preg_match_all("|href\=\"?'?`?([[]:?=&@/;._-]+)\"?'?`?|i", $html, &$matches);
     
     $links = array(); // make an array to store them in
     $ret = $matches[1];
     for($i=0;isset($ret[$i]);$i++)
     {
       // if it starts with http:// save it without editing
       if(preg_match("|^http://(.*)|i",$ret[$i]))
       {
         $links[] = $ret[$i];
       }
       
       // if it matches ../place.html
       elseif(preg_match("|^../(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 3);
       }
       
       // if it matches ./place.html
       elseif(preg_match("|^./(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 2);
       }
       
       // if it matches /place.html
       elseif(preg_match("|^/(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$ret[$i];
       }
       
       // if it matches place.html
       elseif(preg_match("|^(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.$ret[$i];
       }
       
       // if it maches mailto:
       elseif(preg_match("/^mailto:(.*)/i",$ret[$i]))
       {
         // could save email addresses here
       }
     }
     
     return $links ; // return the array of links
   }
The links could be anything from ./file.html to ../../directory/sub/file.html
I need to reconstruct them into a full URL: http://www.server.com/directory/sub/file.html
The main problem reconstructing URL is with the following two:

Code: Select all

// if it matches ../place.html 
       elseif(preg_match("|^../(.*)|i",$ret[ $i])) 
       { 
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 3); 
       } 
        
       // if it matches ./place.html 
       elseif(preg_match("|^./(.*)|i",$ret[ $i])) 
       { 
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 2); 
       }
I know I'm not counting the dots correctly. Can anyone show me how to work this out properly?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

The forum may have removed (can't explain why) some pieces of your regex. Likely a POSIX style character metaclass like :space:
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

if you want to match literal dots, add a backslash in front of them.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

The forum may have removed (can't explain why)
Why do you think so?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Typically, when I want to do something like this, I use realpath() to get the full system path to the file then make sure the leading edge matches the document root, then simply replace the document root with the transport scheme, domain and port as per normal.

I find the following an odd way to expression the quoting structure.

Code: Select all

href\=\"?'?`?
It effectively would allow href="`, but not href=`" to be a legal reference.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Weirdan wrote:Why do you think so?
I've seen it happen before, and

Code: Select all

preg_match_all("|href\="?'?`?([[]:?=&@/;._-]+)"?'?`?|i", $html, &$matches);
looks a bit odd, although I've used similar syntax structures before.
User avatar
pgolovko
Forum Commoner
Posts: 38
Joined: Sun Sep 17, 2006 9:13 am

Post by pgolovko »

If for example, my $url is http://server.com/dir/sub/ then the following will reconstruct the ./place.ptml links correctly:

Code: Select all

// if it matches ./place.html
       elseif(preg_match("|^./(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 2);
       }
Though if the $url is http://server.com/dir/sub/index.html then the above code produces an incorrect link: http://server.com/dir/sub/index.htmlplace.ptml

See my problem?
User avatar
pgolovko
Forum Commoner
Posts: 38
Joined: Sun Sep 17, 2006 9:13 am

Post by pgolovko »

I found a simlar problem and solution written in VB: http://www.vbforums.com/showthread.php?t=414220
This is exactly what I have problem with in the above code.
Post Reply