phpBB auto links issue

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

phpBB auto links issue

Post by JayBird »

This is the regex that phpBB uses to automatically make links clickable (mofidied slightly by me to change the output)

Code: Select all

$ret = preg_replace("#(^|[\n ])([\w]+?://[^ \"\n\r\t<]*)#is", "\\1<a href=\"/out.php?employerID=$employerID&redirectURL=http://\\2\" target=\"_blank\">\\2</a>", $ret);
The problem occures if there is a comma after the URL, like this

This is http://www.domain.co.uk, the comma gets included in the link.

Same for fulltstops, http://www.domain.co.uk.

How can i change the regex above to avoid this?
User avatar
aaronhall
DevNet Resident
Posts: 1040
Joined: Tue Aug 13, 2002 5:10 pm
Location: Back in Phoenix, missing the microbrews
Contact:

Post by aaronhall »

This works, but the problem is that a valid url could have a trailing period or unescaped comma (or exclamation point that I added)... but the rule should always be the most likely condition anyway :)

Code: Select all

$ret = preg_replace("#(^|[\n ])([\w]+?://.*)([\.\!,]?[ \"\n\r\t<]+)#Uis", "\\1<a href=\"/out.php?employerID=$employerID&redirectURL=\\2\" target=\"_blank\">\\2</a>\\3", $ret);
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

Thanks, that looks like it has done the trick.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Here's one that's a bit more robust. I tried to write an approximation of the values it may need to process...

Code: Select all

<?php
	
	$octet = '(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)';
	$ipv4 = '(?:' . $octet . '\.' . $octet . '\.' . $octet . '\.' . $octet . ')';
	$basicdomain = '(?:(?:[a-z0-9_\-]+\.)*[a-z0-9_\-]+\.[a-z]+)';
	$unc = '(?:(?:(?:%\d\d)+|[a-z0-9+_.-]+)+)';
	$domain = '(?:' . $basicdomain . '|' . $ipv4 . ')';
	$protocol = '(?:[a-z][a-z-]*(?<!script))';
	$front = '(?:' . $protocol . '://)';
	$server = '(?:' . $front . $unc . '|' . $front . '?' . $domain . ')';
	$port = '(?::\d+)';
	$name = '(?:(?:(?:%\d\d?)+|[a-z0-9+_.\(\)-]+)*)';
	$value = '(?:(?:=' . $name . ')*)';
	$qsep = '(?:&(?:amp;)?)';
	$namevalue = '(?:' . $qsep . '*' . $name . $value . ')';
	$query = '(?:\?' . $namevalue . '+)';
	$hash = '(?:#' . $namevalue . '+)';
	$path = '(?:(?:/' . $name . $value . ')*)';
	
	function replaceit($match)
	{
		$result = '';
		$match[0] = preg_replace('@&(?!amp;)@i', '&', $match[0]);
		var_dump($match);
		return '~~~' . $match[0] . '~~~';
	}
	
	$test = 'should.match The problem occures if there is a comma after the URL, like this
	
	This is http://www.domain.co.uk, the comma gets included in the link.
	
	Same for full stops, 
	a://c
	a.b
	a..b (no match, nothing between a dot group)
	a..partial.match
	900.g
	a://b.c
	a://b.c:0
	b.c:6574
	0.0.0.0,
	255.225.215.205.
	215.245.256.128 (no match, out of range)
	128.012.12.12 (no match, leading zero)
	-partial://match
	partial.match.9
	a-b://c.d
	a-b://c
	_.b
	_._.c
	__.b
	-.b
	-.-.c
	a://_.c
	a://_._.d
	a://b.c:no-match
	a://-.c
	a://b.c/
	a://b.c/d
	a.b/c
	a://b/c
	no/match
	a://b/c/d
	a://b/c/d/
	a://b/c/=
	a://b/c/=/
	a://b/c/=/d
	a://b/c/=/d/
	a://b/c/%2
	a://b/c/%2/
	a://b/c/%2/d
	a://b/c/%2/d/
	a://b/c/%20
	a://b/c/%20/
	a://b/c/%20/d
	a://b/c/%20/d/	
	a://b/=
	a://b/=/
	a://b/=/c
	a://b/=/c/
	a://b/+
	a://b/+/
	a://b/+/c
	a://b/+/c/
	a://b/. (dot excluded)
	a://b/./
	a://b/./c
	a://b/./c/
	a://b/.hidden
	a://b/.hidden/
	a://b/(
	a://b/(/
	a://b/(.)
	a://b//
	a://b//c
	a://b//c/
	a://b//c//d
	a://b/c?d? (partial match, no secondary question-mark.)
	a://b/c?d?e? (partial match, no secondary question-marks.)
	a://b/c?d=e
	a://b/c?d=e&f
	a://b/c?d&e
	a://b/c?d=e&f
	a://b/c?d&e
	a://b?c
	a://b?c=d
	a://b?c&d
	a://b?c&d
	a://b?c=d&e
	a://b?c=d&e=f
	a://b?c=d&e=f&g
	a://b?c#d
	http://www.domain.co.uk?foo.
	
	How can i change the regex above to avoid this? matches.too';
	
	$pattern = '@(?:^|(?!\s))' . $server . $port . '?' . $path . '?' . $query . '?' . $hash . '?(?<![,.])@is';
	
	var_export($pattern);
	
	echo PHP_EOL;
	
	$out = preg_replace_callback($pattern, 'replaceit', $test);
	
	echo PHP_EOL;
	
	var_export($out);
	
	echo PHP_EOL;

Code: Select all

feyd:~ feyd$ php -f regex.php 
'@(?:^|(?!\\s))(?:(?:(?:[a-z][a-z-]*(?<!script))://)(?:(?:(?:%\\d\\d)+|[a-z0-9+_.-]+)+)|(?:(?:[a-z][a-z-]*(?<!script))://)?(?:(?:(?:[a-z0-9_\\-]+\\.)*[a-z0-9_\\-]+\\.[a-z]+)|(?:(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)\\.(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)\\.(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)\\.(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))))(?::\\d+)?(?:(?:/(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*)(?:(?:=(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*))*))*)?(?:\\?(?:(?:&(?:amp;)?)*(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*)(?:(?:=(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*))*))+)?(?:#(?:(?:&(?:amp;)?)*(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*)(?:(?:=(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*))*))+)?(?<![,.])@is'
array(1) {
  [0]=>
  string(12) "should.match"
}
array(1) {
  [0]=>
  string(16) "www.domain.co.uk"
}
array(1) {
  [0]=>
  string(5) "a://c"
}
array(1) {
  [0]=>
  string(3) "a.b"
}
array(1) {
  [0]=>
  string(13) "partial.match"
}
array(1) {
  [0]=>
  string(5) "900.g"
}
array(1) {
  [0]=>
  string(7) "a://b.c"
}
array(1) {
  [0]=>
  string(9) "a://b.c:0"
}
array(1) {
  [0]=>
  string(8) "b.c:6574"
}
array(1) {
  [0]=>
  string(7) "0.0.0.0"
}
array(1) {
  [0]=>
  string(15) "255.225.215.205"
}
array(1) {
  [0]=>
  string(15) "partial://match"
}
array(1) {
  [0]=>
  string(13) "partial.match"
}
array(1) {
  [0]=>
  string(9) "a-b://c.d"
}
array(1) {
  [0]=>
  string(7) "a-b://c"
}
array(1) {
  [0]=>
  string(3) "_.b"
}
array(1) {
  [0]=>
  string(5) "_._.c"
}
array(1) {
  [0]=>
  string(4) "__.b"
}
array(1) {
  [0]=>
  string(3) "-.b"
}
array(1) {
  [0]=>
  string(5) "-.-.c"
}
array(1) {
  [0]=>
  string(7) "a://_.c"
}
array(1) {
  [0]=>
  string(9) "a://_._.d"
}
array(1) {
  [0]=>
  string(7) "a://b.c"
}
array(1) {
  [0]=>
  string(7) "a://-.c"
}
array(1) {
  [0]=>
  string(8) "a://b.c/"
}
array(1) {
  [0]=>
  string(9) "a://b.c/d"
}
array(1) {
  [0]=>
  string(5) "a.b/c"
}
array(1) {
  [0]=>
  string(7) "a://b/c"
}
array(1) {
  [0]=>
  string(9) "a://b/c/d"
}
array(1) {
  [0]=>
  string(10) "a://b/c/d/"
}
array(1) {
  [0]=>
  string(9) "a://b/c/="
}
array(1) {
  [0]=>
  string(10) "a://b/c/=/"
}
array(1) {
  [0]=>
  string(11) "a://b/c/=/d"
}
array(1) {
  [0]=>
  string(12) "a://b/c/=/d/"
}
array(1) {
  [0]=>
  string(10) "a://b/c/%2"
}
array(1) {
  [0]=>
  string(11) "a://b/c/%2/"
}
array(1) {
  [0]=>
  string(12) "a://b/c/%2/d"
}
array(1) {
  [0]=>
  string(13) "a://b/c/%2/d/"
}
array(1) {
  [0]=>
  string(11) "a://b/c/%20"
}
array(1) {
  [0]=>
  string(12) "a://b/c/%20/"
}
array(1) {
  [0]=>
  string(13) "a://b/c/%20/d"
}
array(1) {
  [0]=>
  string(14) "a://b/c/%20/d/"
}
array(1) {
  [0]=>
  string(7) "a://b/="
}
array(1) {
  [0]=>
  string(8) "a://b/=/"
}
array(1) {
  [0]=>
  string(9) "a://b/=/c"
}
array(1) {
  [0]=>
  string(10) "a://b/=/c/"
}
array(1) {
  [0]=>
  string(7) "a://b/+"
}
array(1) {
  [0]=>
  string(8) "a://b/+/"
}
array(1) {
  [0]=>
  string(9) "a://b/+/c"
}
array(1) {
  [0]=>
  string(10) "a://b/+/c/"
}
array(1) {
  [0]=>
  string(6) "a://b/"
}
array(1) {
  [0]=>
  string(8) "a://b/./"
}
array(1) {
  [0]=>
  string(9) "a://b/./c"
}
array(1) {
  [0]=>
  string(10) "a://b/./c/"
}
array(1) {
  [0]=>
  string(13) "a://b/.hidden"
}
array(1) {
  [0]=>
  string(14) "a://b/.hidden/"
}
array(1) {
  [0]=>
  string(7) "a://b/("
}
array(1) {
  [0]=>
  string(8) "a://b/(/"
}
array(1) {
  [0]=>
  string(9) "a://b/(.)"
}
array(1) {
  [0]=>
  string(7) "a://b//"
}
array(1) {
  [0]=>
  string(8) "a://b//c"
}
array(1) {
  [0]=>
  string(9) "a://b//c/"
}
array(1) {
  [0]=>
  string(11) "a://b//c//d"
}
array(1) {
  [0]=>
  string(9) "a://b/c?d"
}
array(1) {
  [0]=>
  string(9) "a://b/c?d"
}
array(1) {
  [0]=>
  string(11) "a://b/c?d=e"
}
array(1) {
  [0]=>
  string(17) "a://b/c?d=e&f"
}
array(1) {
  [0]=>
  string(15) "a://b/c?d&e"
}
array(1) {
  [0]=>
  string(17) "a://b/c?d=e&f"
}
array(1) {
  [0]=>
  string(15) "a://b/c?d&e"
}
array(1) {
  [0]=>
  string(7) "a://b?c"
}
array(1) {
  [0]=>
  string(9) "a://b?c=d"
}
array(1) {
  [0]=>
  string(13) "a://b?c&d"
}
array(1) {
  [0]=>
  string(13) "a://b?c&d"
}
array(1) {
  [0]=>
  string(15) "a://b?c=d&e"
}
array(1) {
  [0]=>
  string(17) "a://b?c=d&e=f"
}
array(1) {
  [0]=>
  string(23) "a://b?c=d&e=f&g"
}
array(1) {
  [0]=>
  string(9) "a://b?c#d"
}
array(1) {
  [0]=>
  string(20) "www.domain.co.uk?foo"
}
array(1) {
  [0]=>
  string(11) "matches.too"
}

'~~~should.match~~~ The problem occures if there is a comma after the URL, like this
	
	This is ~~~www.domain.co.uk~~~, the comma gets included in the link.
	
	Same for full stops, 
	~~~a://c~~~
	~~~a.b~~~
	a..b (no match, nothing between a dot group)
	a..~~~partial.match~~~
	~~~900.g~~~
	~~~a://b.c~~~
	~~~a://b.c:0~~~
	~~~b.c:6574~~~
	~~~0.0.0.0~~~,
	~~~255.225.215.205~~~.
	215.245.256.128 (no match, out of range)
	128.012.12.12 (no match, leading zero)
	-~~~partial://match~~~
	~~~partial.match~~~.9
	~~~a-b://c.d~~~
	~~~a-b://c~~~
	~~~_.b~~~
	~~~_._.c~~~
	~~~__.b~~~
	~~~-.b~~~
	~~~-.-.c~~~
	~~~a://_.c~~~
	~~~a://_._.d~~~
	~~~a://b.c~~~:no-match
	~~~a://-.c~~~
	~~~a://b.c/~~~
	~~~a://b.c/d~~~
	~~~a.b/c~~~
	~~~a://b/c~~~
	no/match
	~~~a://b/c/d~~~
	~~~a://b/c/d/~~~
	~~~a://b/c/=~~~
	~~~a://b/c/=/~~~
	~~~a://b/c/=/d~~~
	~~~a://b/c/=/d/~~~
	~~~a://b/c/%2~~~
	~~~a://b/c/%2/~~~
	~~~a://b/c/%2/d~~~
	~~~a://b/c/%2/d/~~~
	~~~a://b/c/%20~~~
	~~~a://b/c/%20/~~~
	~~~a://b/c/%20/d~~~
	~~~a://b/c/%20/d/~~~	
	~~~a://b/=~~~
	~~~a://b/=/~~~
	~~~a://b/=/c~~~
	~~~a://b/=/c/~~~
	~~~a://b/+~~~
	~~~a://b/+/~~~
	~~~a://b/+/c~~~
	~~~a://b/+/c/~~~
	~~~a://b/~~~. (dot excluded)
	~~~a://b/./~~~
	~~~a://b/./c~~~
	~~~a://b/./c/~~~
	~~~a://b/.hidden~~~
	~~~a://b/.hidden/~~~
	~~~a://b/(~~~
	~~~a://b/(/~~~
	~~~a://b/(.)~~~
	~~~a://b//~~~
	~~~a://b//c~~~
	~~~a://b//c/~~~
	~~~a://b//c//d~~~
	~~~a://b/c?d~~~? (partial match, no secondary question-mark.)
	~~~a://b/c?d~~~?e? (partial match, no secondary question-marks.)
	~~~a://b/c?d=e~~~
	~~~a://b/c?d=e&f~~~
	~~~a://b/c?d&e~~~
	~~~a://b/c?d=e&f~~~
	~~~a://b/c?d&e~~~
	~~~a://b?c~~~
	~~~a://b?c=d~~~
	~~~a://b?c&d~~~
	~~~a://b?c&d~~~
	~~~a://b?c=d&e~~~
	~~~a://b?c=d&e=f~~~
	~~~a://b?c=d&e=f&g~~~
	~~~a://b?c#d~~~
	~~~www.domain.co.uk?foo~~~.
	
	How can i change the regex above to avoid this? ~~~matches.too~~~'
Post Reply