Page 1 of 1
phpBB auto links issue
Posted: Tue Nov 27, 2007 6:14 am
by JayBird
This is the regex that phpBB uses to automatically make links clickable (mofidied slightly by me to change the output)
Code: Select all
$ret = preg_replace("#(^|[\n ])([\w]+?://[^ \"\n\r\t<]*)#is", "\\1<a href=\"/out.php?employerID=$employerID&redirectURL=http://\\2\" target=\"_blank\">\\2</a>", $ret);
The problem occures if there is a comma after the URL, like this
This is
http://www.domain.co.uk, the comma gets included in the link.
Same for fulltstops,
http://www.domain.co.uk.
How can i change the regex above to avoid this?
Posted: Tue Nov 27, 2007 7:07 am
by aaronhall
This works, but the problem is that a valid url could have a trailing period or unescaped comma (or exclamation point that I added)... but the rule should always be the most likely condition anyway
Code: Select all
$ret = preg_replace("#(^|[\n ])([\w]+?://.*)([\.\!,]?[ \"\n\r\t<]+)#Uis", "\\1<a href=\"/out.php?employerID=$employerID&redirectURL=\\2\" target=\"_blank\">\\2</a>\\3", $ret);
Posted: Tue Nov 27, 2007 10:10 am
by JayBird
Thanks, that looks like it has done the trick.
Posted: Tue Nov 27, 2007 1:34 pm
by feyd
Here's one that's a bit more robust. I tried to write an approximation of the values it may need to process...
Code: Select all
<?php
$octet = '(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)';
$ipv4 = '(?:' . $octet . '\.' . $octet . '\.' . $octet . '\.' . $octet . ')';
$basicdomain = '(?:(?:[a-z0-9_\-]+\.)*[a-z0-9_\-]+\.[a-z]+)';
$unc = '(?:(?:(?:%\d\d)+|[a-z0-9+_.-]+)+)';
$domain = '(?:' . $basicdomain . '|' . $ipv4 . ')';
$protocol = '(?:[a-z][a-z-]*(?<!script))';
$front = '(?:' . $protocol . '://)';
$server = '(?:' . $front . $unc . '|' . $front . '?' . $domain . ')';
$port = '(?::\d+)';
$name = '(?:(?:(?:%\d\d?)+|[a-z0-9+_.\(\)-]+)*)';
$value = '(?:(?:=' . $name . ')*)';
$qsep = '(?:&(?:amp;)?)';
$namevalue = '(?:' . $qsep . '*' . $name . $value . ')';
$query = '(?:\?' . $namevalue . '+)';
$hash = '(?:#' . $namevalue . '+)';
$path = '(?:(?:/' . $name . $value . ')*)';
function replaceit($match)
{
$result = '';
$match[0] = preg_replace('@&(?!amp;)@i', '&', $match[0]);
var_dump($match);
return '~~~' . $match[0] . '~~~';
}
$test = 'should.match The problem occures if there is a comma after the URL, like this
This is http://www.domain.co.uk, the comma gets included in the link.
Same for full stops,
a://c
a.b
a..b (no match, nothing between a dot group)
a..partial.match
900.g
a://b.c
a://b.c:0
b.c:6574
0.0.0.0,
255.225.215.205.
215.245.256.128 (no match, out of range)
128.012.12.12 (no match, leading zero)
-partial://match
partial.match.9
a-b://c.d
a-b://c
_.b
_._.c
__.b
-.b
-.-.c
a://_.c
a://_._.d
a://b.c:no-match
a://-.c
a://b.c/
a://b.c/d
a.b/c
a://b/c
no/match
a://b/c/d
a://b/c/d/
a://b/c/=
a://b/c/=/
a://b/c/=/d
a://b/c/=/d/
a://b/c/%2
a://b/c/%2/
a://b/c/%2/d
a://b/c/%2/d/
a://b/c/%20
a://b/c/%20/
a://b/c/%20/d
a://b/c/%20/d/
a://b/=
a://b/=/
a://b/=/c
a://b/=/c/
a://b/+
a://b/+/
a://b/+/c
a://b/+/c/
a://b/. (dot excluded)
a://b/./
a://b/./c
a://b/./c/
a://b/.hidden
a://b/.hidden/
a://b/(
a://b/(/
a://b/(.)
a://b//
a://b//c
a://b//c/
a://b//c//d
a://b/c?d? (partial match, no secondary question-mark.)
a://b/c?d?e? (partial match, no secondary question-marks.)
a://b/c?d=e
a://b/c?d=e&f
a://b/c?d&e
a://b/c?d=e&f
a://b/c?d&e
a://b?c
a://b?c=d
a://b?c&d
a://b?c&d
a://b?c=d&e
a://b?c=d&e=f
a://b?c=d&e=f&g
a://b?c#d
http://www.domain.co.uk?foo.
How can i change the regex above to avoid this? matches.too';
$pattern = '@(?:^|(?!\s))' . $server . $port . '?' . $path . '?' . $query . '?' . $hash . '?(?<![,.])@is';
var_export($pattern);
echo PHP_EOL;
$out = preg_replace_callback($pattern, 'replaceit', $test);
echo PHP_EOL;
var_export($out);
echo PHP_EOL;
Code: Select all
feyd:~ feyd$ php -f regex.php
'@(?:^|(?!\\s))(?:(?:(?:[a-z][a-z-]*(?<!script))://)(?:(?:(?:%\\d\\d)+|[a-z0-9+_.-]+)+)|(?:(?:[a-z][a-z-]*(?<!script))://)?(?:(?:(?:[a-z0-9_\\-]+\\.)*[a-z0-9_\\-]+\\.[a-z]+)|(?:(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)\\.(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)\\.(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)\\.(?:25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))))(?::\\d+)?(?:(?:/(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*)(?:(?:=(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*))*))*)?(?:\\?(?:(?:&(?:amp;)?)*(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*)(?:(?:=(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*))*))+)?(?:#(?:(?:&(?:amp;)?)*(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*)(?:(?:=(?:(?:(?:%\\d\\d?)+|[a-z0-9+_.\\(\\)-]+)*))*))+)?(?<![,.])@is'
array(1) {
[0]=>
string(12) "should.match"
}
array(1) {
[0]=>
string(16) "www.domain.co.uk"
}
array(1) {
[0]=>
string(5) "a://c"
}
array(1) {
[0]=>
string(3) "a.b"
}
array(1) {
[0]=>
string(13) "partial.match"
}
array(1) {
[0]=>
string(5) "900.g"
}
array(1) {
[0]=>
string(7) "a://b.c"
}
array(1) {
[0]=>
string(9) "a://b.c:0"
}
array(1) {
[0]=>
string(8) "b.c:6574"
}
array(1) {
[0]=>
string(7) "0.0.0.0"
}
array(1) {
[0]=>
string(15) "255.225.215.205"
}
array(1) {
[0]=>
string(15) "partial://match"
}
array(1) {
[0]=>
string(13) "partial.match"
}
array(1) {
[0]=>
string(9) "a-b://c.d"
}
array(1) {
[0]=>
string(7) "a-b://c"
}
array(1) {
[0]=>
string(3) "_.b"
}
array(1) {
[0]=>
string(5) "_._.c"
}
array(1) {
[0]=>
string(4) "__.b"
}
array(1) {
[0]=>
string(3) "-.b"
}
array(1) {
[0]=>
string(5) "-.-.c"
}
array(1) {
[0]=>
string(7) "a://_.c"
}
array(1) {
[0]=>
string(9) "a://_._.d"
}
array(1) {
[0]=>
string(7) "a://b.c"
}
array(1) {
[0]=>
string(7) "a://-.c"
}
array(1) {
[0]=>
string(8) "a://b.c/"
}
array(1) {
[0]=>
string(9) "a://b.c/d"
}
array(1) {
[0]=>
string(5) "a.b/c"
}
array(1) {
[0]=>
string(7) "a://b/c"
}
array(1) {
[0]=>
string(9) "a://b/c/d"
}
array(1) {
[0]=>
string(10) "a://b/c/d/"
}
array(1) {
[0]=>
string(9) "a://b/c/="
}
array(1) {
[0]=>
string(10) "a://b/c/=/"
}
array(1) {
[0]=>
string(11) "a://b/c/=/d"
}
array(1) {
[0]=>
string(12) "a://b/c/=/d/"
}
array(1) {
[0]=>
string(10) "a://b/c/%2"
}
array(1) {
[0]=>
string(11) "a://b/c/%2/"
}
array(1) {
[0]=>
string(12) "a://b/c/%2/d"
}
array(1) {
[0]=>
string(13) "a://b/c/%2/d/"
}
array(1) {
[0]=>
string(11) "a://b/c/%20"
}
array(1) {
[0]=>
string(12) "a://b/c/%20/"
}
array(1) {
[0]=>
string(13) "a://b/c/%20/d"
}
array(1) {
[0]=>
string(14) "a://b/c/%20/d/"
}
array(1) {
[0]=>
string(7) "a://b/="
}
array(1) {
[0]=>
string(8) "a://b/=/"
}
array(1) {
[0]=>
string(9) "a://b/=/c"
}
array(1) {
[0]=>
string(10) "a://b/=/c/"
}
array(1) {
[0]=>
string(7) "a://b/+"
}
array(1) {
[0]=>
string(8) "a://b/+/"
}
array(1) {
[0]=>
string(9) "a://b/+/c"
}
array(1) {
[0]=>
string(10) "a://b/+/c/"
}
array(1) {
[0]=>
string(6) "a://b/"
}
array(1) {
[0]=>
string(8) "a://b/./"
}
array(1) {
[0]=>
string(9) "a://b/./c"
}
array(1) {
[0]=>
string(10) "a://b/./c/"
}
array(1) {
[0]=>
string(13) "a://b/.hidden"
}
array(1) {
[0]=>
string(14) "a://b/.hidden/"
}
array(1) {
[0]=>
string(7) "a://b/("
}
array(1) {
[0]=>
string(8) "a://b/(/"
}
array(1) {
[0]=>
string(9) "a://b/(.)"
}
array(1) {
[0]=>
string(7) "a://b//"
}
array(1) {
[0]=>
string(8) "a://b//c"
}
array(1) {
[0]=>
string(9) "a://b//c/"
}
array(1) {
[0]=>
string(11) "a://b//c//d"
}
array(1) {
[0]=>
string(9) "a://b/c?d"
}
array(1) {
[0]=>
string(9) "a://b/c?d"
}
array(1) {
[0]=>
string(11) "a://b/c?d=e"
}
array(1) {
[0]=>
string(17) "a://b/c?d=e&f"
}
array(1) {
[0]=>
string(15) "a://b/c?d&e"
}
array(1) {
[0]=>
string(17) "a://b/c?d=e&f"
}
array(1) {
[0]=>
string(15) "a://b/c?d&e"
}
array(1) {
[0]=>
string(7) "a://b?c"
}
array(1) {
[0]=>
string(9) "a://b?c=d"
}
array(1) {
[0]=>
string(13) "a://b?c&d"
}
array(1) {
[0]=>
string(13) "a://b?c&d"
}
array(1) {
[0]=>
string(15) "a://b?c=d&e"
}
array(1) {
[0]=>
string(17) "a://b?c=d&e=f"
}
array(1) {
[0]=>
string(23) "a://b?c=d&e=f&g"
}
array(1) {
[0]=>
string(9) "a://b?c#d"
}
array(1) {
[0]=>
string(20) "www.domain.co.uk?foo"
}
array(1) {
[0]=>
string(11) "matches.too"
}
'~~~should.match~~~ The problem occures if there is a comma after the URL, like this
This is ~~~www.domain.co.uk~~~, the comma gets included in the link.
Same for full stops,
~~~a://c~~~
~~~a.b~~~
a..b (no match, nothing between a dot group)
a..~~~partial.match~~~
~~~900.g~~~
~~~a://b.c~~~
~~~a://b.c:0~~~
~~~b.c:6574~~~
~~~0.0.0.0~~~,
~~~255.225.215.205~~~.
215.245.256.128 (no match, out of range)
128.012.12.12 (no match, leading zero)
-~~~partial://match~~~
~~~partial.match~~~.9
~~~a-b://c.d~~~
~~~a-b://c~~~
~~~_.b~~~
~~~_._.c~~~
~~~__.b~~~
~~~-.b~~~
~~~-.-.c~~~
~~~a://_.c~~~
~~~a://_._.d~~~
~~~a://b.c~~~:no-match
~~~a://-.c~~~
~~~a://b.c/~~~
~~~a://b.c/d~~~
~~~a.b/c~~~
~~~a://b/c~~~
no/match
~~~a://b/c/d~~~
~~~a://b/c/d/~~~
~~~a://b/c/=~~~
~~~a://b/c/=/~~~
~~~a://b/c/=/d~~~
~~~a://b/c/=/d/~~~
~~~a://b/c/%2~~~
~~~a://b/c/%2/~~~
~~~a://b/c/%2/d~~~
~~~a://b/c/%2/d/~~~
~~~a://b/c/%20~~~
~~~a://b/c/%20/~~~
~~~a://b/c/%20/d~~~
~~~a://b/c/%20/d/~~~
~~~a://b/=~~~
~~~a://b/=/~~~
~~~a://b/=/c~~~
~~~a://b/=/c/~~~
~~~a://b/+~~~
~~~a://b/+/~~~
~~~a://b/+/c~~~
~~~a://b/+/c/~~~
~~~a://b/~~~. (dot excluded)
~~~a://b/./~~~
~~~a://b/./c~~~
~~~a://b/./c/~~~
~~~a://b/.hidden~~~
~~~a://b/.hidden/~~~
~~~a://b/(~~~
~~~a://b/(/~~~
~~~a://b/(.)~~~
~~~a://b//~~~
~~~a://b//c~~~
~~~a://b//c/~~~
~~~a://b//c//d~~~
~~~a://b/c?d~~~? (partial match, no secondary question-mark.)
~~~a://b/c?d~~~?e? (partial match, no secondary question-marks.)
~~~a://b/c?d=e~~~
~~~a://b/c?d=e&f~~~
~~~a://b/c?d&e~~~
~~~a://b/c?d=e&f~~~
~~~a://b/c?d&e~~~
~~~a://b?c~~~
~~~a://b?c=d~~~
~~~a://b?c&d~~~
~~~a://b?c&d~~~
~~~a://b?c=d&e~~~
~~~a://b?c=d&e=f~~~
~~~a://b?c=d&e=f&g~~~
~~~a://b?c#d~~~
~~~www.domain.co.uk?foo~~~.
How can i change the regex above to avoid this? ~~~matches.too~~~'