Page 1 of 1

building a URL string from parts

Posted: Sat Oct 21, 2006 2:52 pm
by pgolovko

Code: Select all

function GetUrls($url)
   {
     $info = @parse_url($url); // parse the url
     
     $html = $this->temp_everything; // gets what was sent back
     if (!$html) // check it's not false
     {
       return false; // if it is return false
     }
     
     $pieces = preg_split ("/(\r\n\r\n|\r\r|\n\n)/", $html, 2); // split the HTML from the headers
     $html = $pieces[1]; // save the HTML
     unset($pieces); // unset everything else
     
     // find all the urls
     preg_match_all("|href\=\"?'?`?([[]:?=&@/;._-]+)\"?'?`?|i", $html, &$matches);
     
     $links = array(); // make an array to store them in
     $ret = $matches[1];
     for($i=0;isset($ret[$i]);$i++)
     {
       // if it starts with http:// save it without editing
       if(preg_match("|^http://(.*)|i",$ret[$i]))
       {
         $links[] = $ret[$i];
       }
       
       // if it matches ../place.html
       elseif(preg_match("|^../(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 3);
       }
       
       // if it matches ./place.html
       elseif(preg_match("|^./(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 2);
       }
       
       // if it matches /place.html
       elseif(preg_match("|^/(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$ret[$i];
       }
       
       // if it matches place.html
       elseif(preg_match("|^(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.$ret[$i];
       }
       
       // if it maches mailto:
       elseif(preg_match("/^mailto:(.*)/i",$ret[$i]))
       {
         // could save email addresses here
       }
     }
     
     return $links ; // return the array of links
   }
The links could be anything from ./file.html to ../../directory/sub/file.html
I need to reconstruct them into a full URL: http://www.server.com/directory/sub/file.html
The main problem reconstructing URL is with the following two:

Code: Select all

// if it matches ../place.html 
       elseif(preg_match("|^../(.*)|i",$ret[ $i])) 
       { 
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 3); 
       } 
        
       // if it matches ./place.html 
       elseif(preg_match("|^./(.*)|i",$ret[ $i])) 
       { 
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 2); 
       }
I know I'm not counting the dots correctly. Can anyone show me how to work this out properly?

Posted: Sat Oct 21, 2006 2:54 pm
by feyd
The forum may have removed (can't explain why) some pieces of your regex. Likely a POSIX style character metaclass like :space:

Posted: Sat Oct 21, 2006 2:56 pm
by Weirdan
if you want to match literal dots, add a backslash in front of them.

Posted: Sat Oct 21, 2006 2:57 pm
by Weirdan
The forum may have removed (can't explain why)
Why do you think so?

Posted: Sat Oct 21, 2006 3:00 pm
by feyd
Typically, when I want to do something like this, I use realpath() to get the full system path to the file then make sure the leading edge matches the document root, then simply replace the document root with the transport scheme, domain and port as per normal.

I find the following an odd way to expression the quoting structure.

Code: Select all

href\=\"?'?`?
It effectively would allow href="`, but not href=`" to be a legal reference.

Posted: Sat Oct 21, 2006 3:02 pm
by feyd
Weirdan wrote:Why do you think so?
I've seen it happen before, and

Code: Select all

preg_match_all("|href\="?'?`?([[]:?=&@/;._-]+)"?'?`?|i", $html, &$matches);
looks a bit odd, although I've used similar syntax structures before.

Posted: Sat Oct 21, 2006 3:07 pm
by pgolovko
If for example, my $url is http://server.com/dir/sub/ then the following will reconstruct the ./place.ptml links correctly:

Code: Select all

// if it matches ./place.html
       elseif(preg_match("|^./(.*)|i",$ret[ $i]))
       {
         $links[] = 'http://'.$info["host"].''.$info["path"].''.substr($ret[$i], 2);
       }
Though if the $url is http://server.com/dir/sub/index.html then the above code produces an incorrect link: http://server.com/dir/sub/index.htmlplace.ptml

See my problem?

Posted: Sat Oct 21, 2006 3:26 pm
by pgolovko
I found a simlar problem and solution written in VB: http://www.vbforums.com/showthread.php?t=414220
This is exactly what I have problem with in the above code.