Page 1 of 1

Regular Expression Help

Posted: Fri Jan 21, 2011 4:32 pm
by J0kerz
Hi there,

I am trying to extract the ROOT DOMAIN from a list of URL.

I am having trouble with the regular expression to achieve this.

Here is my code so far, could somebody please correct my regular expression.

Code: Select all

$array = array('http://www.example.com/',
'http://www.example.com/a/a.com',
'http://www.example.com/a.com',
'http://example.com/',
'http://example.com/a/a.com',
'http://example.com/a.com',
'https://www.example.com/',
'https://www.example.com/a/a.com',
'https://www.example.com/a.com',
'https://example.com/',
'https://example.com/a/a.com',
'https://example.com/a.com',
'http://www.a.example.com/',
'http://www.a.example.com/a/a.com',
'http://www.a.example.com/a.com',
'http://a.example.com/',
'http://a.example.com/a/a.com',
'http://a.example.com/a.com',
'https://www.a.example.com/',
'https://www.a.example.com/a/a.com',
'https://www.a.example.com/a.com',
'https://a.example.com/',
'https://a.example.com/a/a.com',
'https://a.example.com/a.com');

	//Check links from Same Root Url
	for($x = 0; $x < count($array); $x++){
	

		//Extract Root Domain
		if (preg_match('/[http|https]+:\/\/[|www\.]*(.*)\//', $array[$x], $root_url)){
		
			echo $root_url[1].'<br>';
			


		}
		
	
	}
The results should contains only:

example.com
a.example.com

Thanks for your time!

Re: Regular Expression Help

Posted: Fri Jan 21, 2011 4:40 pm
by John Cartwright
I would prefer to create a callback to extract the domains with parse_url(). Using built in functions is just more reliable than a regex, plus I'm just not a regex guru :D

Code: Select all

function extractDomain($url) {
   if ($parts = parse_url($url)) {
      return $parts['host'];
   }
   return false;
}

$domains = array_map('extractDomain', explode(PHP_EOL, $domainlist));
//or if your crazy and like 1 liners

Code: Select all

$domains = array_map(create_function('$a', 'if ($b = parse_url($a)) return $b["host"]; return false;'), explode(PHP_EOL, $domainlist));

Re: Regular Expression Help

Posted: Fri Jan 21, 2011 4:46 pm
by J0kerz
Thanks John!

I know that I can use the parse_url function but I am looking to do it with a Regex expression :)

I am currently trying to learn Regular expressions ^^

Re: Regular Expression Help

Posted: Fri Jan 21, 2011 5:24 pm
by pickle
If you want to stick with Regex, this should be in the Regex forum. Moving.

Re: Regular Expression Help

Posted: Fri Jan 21, 2011 5:35 pm
by pickle
I believe this pattern works: [syntax]:([\w-]*\.\w*)/:[/syntax]

The wordy description of the pattern would be:
match the first string that starts with any number of alphanumeric characters or dashes, followed by a single period, followed by any number of alphanumeric characters, followed by a slash. I want everything matched before the slash.
1 caveat: I'm assuming the only valid characters in a Top Level Domain are a-z,A-Z,0-9, underscores and dashes.

Note the non-typical-but-still-completely-valid pattern delimiters.
Also note that the pattern doesn't try to make sure http(s):// is at the start. Since you know it's a URL, you can assume that, and skip over it.

Re: Regular Expression Help

Posted: Wed Jan 26, 2011 12:42 pm
by ridgerunner
This new url_valid function checks the validity of a URL and if valid, returns an array of the URI components (i.e. scheme, authority, userinfo, host, port, path, query, fragment, etc.) The component you will be interested in looking at is probably the "host" (it is "the domain" - everything up to the first forward slash - if one exists).

Code: Select all

//
// function url_valid($url) {
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
//    [scheme] => http
//    [authority] => www.jmrware.com:80
//    [userinfo] => 
//    [host] => www.jmrware.com
//    [IP_literal] => 
//    [IPV6address] => 
//    [ls32] => 
//    [IPvFuture] => 
//    [IPv4address] => 
//    [regname] => www.jmrware.com
//    [port] => 80
//    [path_abempty] => /articles
//    [query] => height=10&width=75
//    [fragment] => fragone
//    [url] => http://www.jmrware.com:80/articles?height=10&width=75#fragone
// )
function url_valid($url) {
	if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
	if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
	if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?:: 
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
		/mx', $url, $m)) return FALSE;
	switch ($m['scheme']) {
	case 'https':
	case 'http':
		if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
		break;
	case 'ftps':
	case 'ftp':
		break;
	default:
		return FALSE;	// Unrecognised URI scheme. Default to FALSE.
	}
	// Validate host name conforms to DNS "dot-separated-parts".
	if ($m{'regname'}) { // If host regname specified, check for DNS conformance.
		if (!preg_match('/^(?!.{256})(?:[0-9A-Za-z]\.|[0-9A-Za-z][-0-9A-Za-z]{0,61}[0-9A-Za-z]\.)+(?:com|edu|gov|int|mil|net|org|biz|info|name|pro|aero|coop|museum|arpa|asia|cat|jobs|mobi|tel|travel|[A-Za-z]{2})$/im', $m['host'])) return FALSE;
	}
	$m['url'] = $url;
	for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
	return $m; // return TRUE == array of useful named $matches plus the valid $url.
}
This function is the result of an article I'm writing on the subject: Regular Expression URI Validationl