Regular Expression Help

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
J0kerz
Forum Commoner
Posts: 37
Joined: Fri May 29, 2009 2:51 pm

Regular Expression Help

Post by J0kerz »

Hi there,

I am trying to extract the ROOT DOMAIN from a list of URL.

I am having trouble with the regular expression to achieve this.

Here is my code so far, could somebody please correct my regular expression.

Code: Select all

$array = array('http://www.example.com/',
'http://www.example.com/a/a.com',
'http://www.example.com/a.com',
'http://example.com/',
'http://example.com/a/a.com',
'http://example.com/a.com',
'https://www.example.com/',
'https://www.example.com/a/a.com',
'https://www.example.com/a.com',
'https://example.com/',
'https://example.com/a/a.com',
'https://example.com/a.com',
'http://www.a.example.com/',
'http://www.a.example.com/a/a.com',
'http://www.a.example.com/a.com',
'http://a.example.com/',
'http://a.example.com/a/a.com',
'http://a.example.com/a.com',
'https://www.a.example.com/',
'https://www.a.example.com/a/a.com',
'https://www.a.example.com/a.com',
'https://a.example.com/',
'https://a.example.com/a/a.com',
'https://a.example.com/a.com');

	//Check links from Same Root Url
	for($x = 0; $x < count($array); $x++){
	

		//Extract Root Domain
		if (preg_match('/[http|https]+:\/\/[|www\.]*(.*)\//', $array[$x], $root_url)){
		
			echo $root_url[1].'<br>';
			


		}
		
	
	}
The results should contains only:

example.com
a.example.com

Thanks for your time!
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Re: Regular Expression Help

Post by John Cartwright »

I would prefer to create a callback to extract the domains with parse_url(). Using built in functions is just more reliable than a regex, plus I'm just not a regex guru :D

Code: Select all

function extractDomain($url) {
   if ($parts = parse_url($url)) {
      return $parts['host'];
   }
   return false;
}

$domains = array_map('extractDomain', explode(PHP_EOL, $domainlist));
//or if your crazy and like 1 liners

Code: Select all

$domains = array_map(create_function('$a', 'if ($b = parse_url($a)) return $b["host"]; return false;'), explode(PHP_EOL, $domainlist));
User avatar
J0kerz
Forum Commoner
Posts: 37
Joined: Fri May 29, 2009 2:51 pm

Re: Regular Expression Help

Post by J0kerz »

Thanks John!

I know that I can use the parse_url function but I am looking to do it with a Regex expression :)

I am currently trying to learn Regular expressions ^^
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Re: Regular Expression Help

Post by pickle »

If you want to stick with Regex, this should be in the Regex forum. Moving.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Re: Regular Expression Help

Post by pickle »

I believe this pattern works: [syntax]:([\w-]*\.\w*)/:[/syntax]

The wordy description of the pattern would be:
match the first string that starts with any number of alphanumeric characters or dashes, followed by a single period, followed by any number of alphanumeric characters, followed by a slash. I want everything matched before the slash.
1 caveat: I'm assuming the only valid characters in a Top Level Domain are a-z,A-Z,0-9, underscores and dashes.

Note the non-typical-but-still-completely-valid pattern delimiters.
Also note that the pattern doesn't try to make sure http(s):// is at the start. Since you know it's a URL, you can assume that, and skip over it.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Regular Expression Help

Post by ridgerunner »

This new url_valid function checks the validity of a URL and if valid, returns an array of the URI components (i.e. scheme, authority, userinfo, host, port, path, query, fragment, etc.) The component you will be interested in looking at is probably the "host" (it is "the domain" - everything up to the first forward slash - if one exists).

Code: Select all

//
// function url_valid($url) {
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
//    [scheme] => http
//    [authority] => www.jmrware.com:80
//    [userinfo] => 
//    [host] => www.jmrware.com
//    [IP_literal] => 
//    [IPV6address] => 
//    [ls32] => 
//    [IPvFuture] => 
//    [IPv4address] => 
//    [regname] => www.jmrware.com
//    [port] => 80
//    [path_abempty] => /articles
//    [query] => height=10&width=75
//    [fragment] => fragone
//    [url] => http://www.jmrware.com:80/articles?height=10&width=75#fragone
// )
function url_valid($url) {
	if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
	if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
	if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?:: 
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
		/mx', $url, $m)) return FALSE;
	switch ($m['scheme']) {
	case 'https':
	case 'http':
		if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
		break;
	case 'ftps':
	case 'ftp':
		break;
	default:
		return FALSE;	// Unrecognised URI scheme. Default to FALSE.
	}
	// Validate host name conforms to DNS "dot-separated-parts".
	if ($m{'regname'}) { // If host regname specified, check for DNS conformance.
		if (!preg_match('/^(?!.{256})(?:[0-9A-Za-z]\.|[0-9A-Za-z][-0-9A-Za-z]{0,61}[0-9A-Za-z]\.)+(?:com|edu|gov|int|mil|net|org|biz|info|name|pro|aero|coop|museum|arpa|asia|cat|jobs|mobi|tel|travel|[A-Za-z]{2})$/im', $m['host'])) return FALSE;
	}
	$m['url'] = $url;
	for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
	return $m; // return TRUE == array of useful named $matches plus the valid $url.
}
This function is the result of an article I'm writing on the subject: Regular Expression URI Validationl
Post Reply