Extract Domain Name & Top Level Name from URL(s)
Posted: Thu Mar 01, 2007 4:20 pm
I am trying to figure the best way to extract just (domain.tld(.tld?)(.tld?)) from any URL string, while not including the (sub-domain). The problem is that there are valid 2 and 3 dotted tld(s). So what would be the best way to approach this. I written 4 or 5 different functions, and they seem to work, but I am hoping there is a better way.
list.txt (should all be on a single line)
// domain is the ['host'] from parse_url()
TIA
pif
list.txt (should all be on a single line)
Code: Select all
|ac.cn|ac.jp|ac.uk|ad.jp|adm.br|adv.br|agr.br|ah.cn|am.br|arq.br|
art.br|asn.au|ato.br|av.tr|bel.tr|bio.br|biz.tr|bj.cn|bmd.br|cim.br|cng.br|
cnt.br|co.at|co.jp|co.uk|com.au|com.br|com.cn|com.eg|com.hk|com.mx|
com.ru|com.tr|com.tw|conf.au|cq.cn|csiro.au|dr.tr|ecn.br|edu.au|edu.br|
edu.tr|emu.id.au|eng.br|esp.br|etc.br|eti.br|eun.eg|far.br|fj.cn|fm.br|
fnd.br|fot.br|fst.br|g12.br|gb.com|gb.net|gd.cn|gen.tr|ggf.br|gob.mx|
gov.au|gov.br|gov.cn|gov.hk|gov.tr|gr.jp|gs.cn|gx.cn|gz.cn|ha.cn|
hb.cn|he.cn|hi.cn|hk.cn|hl.cn|hn.cn|id.au|idv.tw|imb.br|ind.br|inf.br|
info.au|info.tr|jl.cn|jor.br|js.cn|jx.cn|k12.tr|lel.br|ln.cn|ltd.uk|mat.br|
me.uk|med.br|mil.br|mil.tr|mo.cn|mus.br|name.tr|ne.jp|net.au|net.br|
net.cn|net.eg|net.hk|net.lu|net.mx|net.ru|net.tr|net.tw|net.uk|nm.cn|
no.com|nom.br|not.br|ntr.br|nx.cn|odo.br|oop.br|or.at|or.jp|org.au|
org.br|org.cn|org.hk|org.lu|org.ru|org.tr|org.tw|org.uk|plc.uk|pol.tr|
pp.ru|ppg.br|pro.br|psc.br|psi.br|qh.cn|qsl.br|rec.br|sc.cn|sd.cn|se.com|
se.net|sh.cn|slg.br|sn.cn|srv.br|sx.cn|tel.tr|tj.cn|tmp.br|trd.br|tur.br|
tv.br|tw.cn|uk.com|uk.net|vet.br|wattle.id.au|web.tr|xj.cn|xz.cn|
yn.cn|zj.cn|zlg.br|co.nr|co.nz|Code: Select all
function if_domain ( $domain )
{
// if single dot or no dot return it
if ( ( $next = substr_count ( $domain, '.' ) ) <= 1 )
{
// return localhost, system_name or domain_name.top_level_name
return $domain;
}
// split the domain into parts
$name = explode ( '.', $domain );
// last part
$test = $name[$next];
// go to the next last part
$next -= 1;
// merge the last two parts
$test = $name[$next] . '.' . $test;
// get the list of silly top_level_names
$list = file_get_contents ( 'list.txt' );
// found a match (part_A_top_level_name.part_B_top_level_name)
if ( strpos ( $list, '|' . $test . '|' ) )
{
// get the next last part
$next -= 1;
// merge the last three parts
$last = $name[$next] . '.' . $test;
// found a match (part_A_top_level_name.part_B_top_level_name.part_C_top_level_name)
if ( strpos ( $list, '|' . $last . '|' ) )
{
$next -= 1;
//return domain_name.part_A_top_level_name.part_B_top_level_name.part_C_top_level_name
return $name[$next] . '.' . $last;
}
// return domain_name.part_A_top_level_name.part_B_top_level_name
return $last;
}
// return domain_name.top_level_name
return $test;
}pif