PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Thu Sep 20, 2018 10:34 am

All times are UTC - 5 hours




Post new topic Reply to topic  [ 6 posts ] 
Author Message
 Post subject: Regular Expression Help
PostPosted: Fri Jan 21, 2011 5:32 pm 
Offline
Forum Commoner
User avatar

Joined: Fri May 29, 2009 2:51 pm
Posts: 37
Hi there,

I am trying to extract the ROOT DOMAIN from a list of URL.

I am having trouble with the regular expression to achieve this.

Here is my code so far, could somebody please correct my regular expression.


Syntax: [ Download ] [ Hide ]
$array = array('http://www.example.com/',
'http://www.example.com/a/a.com',
'http://www.example.com/a.com',
'http://example.com/',
'http://example.com/a/a.com',
'http://example.com/a.com',
'https://www.example.com/',
'https://www.example.com/a/a.com',
'https://www.example.com/a.com',
'https://example.com/',
'https://example.com/a/a.com',
'https://example.com/a.com',
'http://www.a.example.com/',
'http://www.a.example.com/a/a.com',
'http://www.a.example.com/a.com',
'http://a.example.com/',
'http://a.example.com/a/a.com',
'http://a.example.com/a.com',
'https://www.a.example.com/',
'https://www.a.example.com/a/a.com',
'https://www.a.example.com/a.com',
'https://a.example.com/',
'https://a.example.com/a/a.com',
'https://a.example.com/a.com');

        //Check links from Same Root Url
        for($x = 0; $x < count($array); $x++){
       

                //Extract Root Domain
                if (preg_match('/[http|https]+:\/\/[|www\.]*(.*)\//', $array[$x], $root_url)){
               
                        echo $root_url[1].'<br>';
                       


                }
               
       
        }


The results should contains only:

example.com
a.example.com

Thanks for your time!


Top
 Profile  
 
PostPosted: Fri Jan 21, 2011 5:40 pm 
Offline
Site Admin
User avatar

Joined: Tue Dec 23, 2003 3:10 am
Posts: 11470
Location: Toronto
I would prefer to create a callback to extract the domains with parse_url(). Using built in functions is just more reliable than a regex, plus I'm just not a regex guru :D

Syntax: [ Download ] [ Hide ]
function extractDomain($url) {
   if ($parts = parse_url($url)) {
      return $parts['host'];
   }
   return false;
}

$domains = array_map('extractDomain', explode(PHP_EOL, $domainlist));


//or if your crazy and like 1 liners

Syntax: [ Download ] [ Hide ]
$domains = array_map(create_function('$a', 'if ($b = parse_url($a)) return $b["host"]; return false;'), explode(PHP_EOL, $domainlist));


Top
 Profile  
 
PostPosted: Fri Jan 21, 2011 5:46 pm 
Offline
Forum Commoner
User avatar

Joined: Fri May 29, 2009 2:51 pm
Posts: 37
Thanks John!

I know that I can use the parse_url function but I am looking to do it with a Regex expression :)

I am currently trying to learn Regular expressions ^^


Top
 Profile  
 
PostPosted: Fri Jan 21, 2011 6:24 pm 
Offline
Briney Mod
User avatar

Joined: Mon Jan 19, 2004 7:11 pm
Posts: 6445
Location: 53.01N x 112.48W
If you want to stick with Regex, this should be in the Regex forum. Moving.

_________________
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.


Top
 Profile  
 
PostPosted: Fri Jan 21, 2011 6:35 pm 
Offline
Briney Mod
User avatar

Joined: Mon Jan 19, 2004 7:11 pm
Posts: 6445
Location: 53.01N x 112.48W
I believe this pattern works:
Syntax: [ Download ] [ Hide ]
  1. :([\w-]*\.\w*)/: 


The wordy description of the pattern would be:
Quote:
match the first string that starts with any number of alphanumeric characters or dashes, followed by a single period, followed by any number of alphanumeric characters, followed by a slash. I want everything matched before the slash.


1 caveat: I'm assuming the only valid characters in a Top Level Domain are a-z,A-Z,0-9, underscores and dashes.

Note the non-typical-but-still-completely-valid pattern delimiters.
Also note that the pattern doesn't try to make sure http(s):// is at the start. Since you know it's a URL, you can assume that, and skip over it.

_________________
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.


Top
 Profile  
 
PostPosted: Wed Jan 26, 2011 1:42 pm 
Offline
Forum Contributor
User avatar

Joined: Sun Jul 05, 2009 10:39 pm
Posts: 214
Location: SLC, UT
This new url_valid function checks the validity of a URL and if valid, returns an array of the URI components (i.e. scheme, authority, userinfo, host, port, path, query, fragment, etc.) The component you will be interested in looking at is probably the "host" (it is "the domain" - everything up to the first forward slash - if one exists).
Syntax: [ Download ] [ Hide ]
//
// function url_valid($url) {
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
//    [scheme] => http
//    [authority] => www.jmrware.com:80
//    [userinfo] =>
//    [host] => www.jmrware.com
//    [IP_literal] =>
//    [IPV6address] =>
//    [ls32] =>
//    [IPvFuture] =>
//    [IPv4address] =>
//    [regname] => www.jmrware.com
//    [port] => 80
//    [path_abempty] => /articles
//    [query] => height=10&width=75
//    [fragment] => fragone
//    [url] => http://www.jmrware.com:80/articles?heig ... 75#fragone
// )
function url_valid($url) {
        if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
        if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
        if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
                /mx'
, $url, $m)) return FALSE;
        switch ($m['scheme']) {
        case 'https':
        case 'http':
                if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
                break;
        case 'ftps':
        case 'ftp':
                break;
        default:
                return FALSE;   // Unrecognised URI scheme. Default to FALSE.
        }
        // Validate host name conforms to DNS "dot-separated-parts".
        if ($m{'regname'}) { // If host regname specified, check for DNS conformance.
                if (!preg_match('/^(?!.{256})(?:[0-9A-Za-z]\.|[0-9A-Za-z][-0-9A-Za-z]{0,61}[0-9A-Za-z]\.)+(?:com|edu|gov|int|mil|net|org|biz|info|name|pro|aero|coop|museum|arpa|asia|cat|jobs|mobi|tel|travel|[A-Za-z]{2})$/im', $m['host'])) return FALSE;
        }
        $m['url'] = $url;
        for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
        return $m; // return TRUE == array of useful named $matches plus the valid $url.
}
 


This function is the result of an article I'm writing on the subject: Regular Expression URI Validationl


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: Google [Bot] and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group