Page 1 of 1

Uncoded [url] tags :)

Posted: Fri Aug 07, 2009 10:25 am
by jackpf
Good afternoon everyone :)

Ok, so for my forum, I'm trying to use some regex that will wrap uncoded urls (ie, someone has posted a url, but forgotten to put tags around it) with [url] tags. So b ... nd without after it.

Here's what I have so far:

Code: Select all

$code = preg_replace('/^(?!\[url(\=.*?)?\](\s)?)(http|https|ftp)\:\/\/+([a-z0-9\-\_\.]+)(\.[a-z]+){1,2}(.*?)($|\s)/', '[url]$3://$4$5[/url]$6', $code);
 
Basically, this works, but only if the url is at the beginning of the string. Take this example:

Code: Select all

<?php
function foo($code)
{
    return preg_replace('/^(?!\[url(\=.*?)?\](\s)?)(http|https|ftp)\:\/\/+([a-z0-9\-\_\.]+)(\.[a-z]+){1,2}(.*?)($|\s)/', '[url]$3://$4$5[/url]$6', $code);
}
 
echo foo('http://google.co.uk').'<br />';
echo foo('[ url]http://google.co.uk[/url] (without the space)').'<br />';
echo foo('something before...so this should not work :( http://google.co.uk').'<br />';

You (should) see that it works as long as there isn't anything before the url, IE, it works if the url is the beginning of the string. So, the third example won't work. I have a feeling it's something to do with the "^" at the beginning, but without that, it will perform the replacement even if the url is already wrapped with [url] tags.

So yeah...that's my dilemma. Any help would be greatly appreciated.

Thanks, all the best,
Jack.

Re: Uncoded [url] tags :)

Posted: Fri Aug 07, 2009 10:52 am
by Eric!
Is this what you want?

Code: Select all

return preg_replace('/(?!\[url(\=.*?)?\](\s)?)(HTTP|HTTPS|FTP|http|https|ftp)\:\/\/+([a-zA-Z0-9\-\_\.]+)(\.[a-zA-Z]+){1,2}(.*?)($|\s)/', '[url]$3://$4$5[/url]$6', $code);
Returns (I added the space so the tags would show up):

Code: Select all

[ url]http://google.co.uk[/url]
[ url][url]http://google.co.uk[/url][/url](without the space)
something before...so this should not work :( [ url]http://google.co.uk[/url]
I don't quite get your case with the space, did you want the regex to remove that too? I also added the case for captial letters.

By the way, I saw something cool the other day. On a board where you post a link the board would go fetch the page's title and put that in the text for the anchor tag. So instead of URL jibberish you had the title of the page displayed. Maybe that isn't anything new, but I thought it was kinda nice.

Re: Uncoded [url] tags :)

Posted: Fri Aug 07, 2009 1:24 pm
by prometheuzz
Jack,

I too don't quite get what the problem is. In your code you make a comment "so this should not work", which it doesn't, so what's wrong?
Also, your regex pattern can be compressed quite a bit:

- many of the characters you escape, need no escaping;
- if you're matching slashes, using a different delimiter than '/' would make your regex more readable;
- something like '(http|https|ftp)' is the same as '(https?|ftp)'
- '(\s)?' is the same as '\s?'
- at the end, you match '(.*?)($|\s)' which you then replace with '$6' in the replacement string. Why? You mind as well leave that part out.

Having said all that, this regex does exactly the same as your original, but is, IMO, far more readable (and therefore maintainable!):

Code: Select all

preg_replace('#^(?!\[url(=.*?)?]\s?)(https?|ftp)://+([a-z0-9_.-]+)(\.[a-z]+){1,2}#', '[url]$2://$3$4[/url]', $code)
There are still quite a few things I'd do differently, but before commenting more on it, I'd like to know what exactly you're trying to do.

HTH.

Re: Uncoded [url] tags :)

Posted: Fri Aug 07, 2009 1:27 pm
by prometheuzz
Eric! wrote:Is this what you want?

Code: Select all

return preg_replace('/(?!\[url(\=.*?)?\](\s)?)(HTTP|HTTPS|FTP|http|https|ftp)\:\/\/+([a-zA-Z0-9\-\_\.]+)(\.[a-zA-Z]+){1,2}(.*?)($|\s)/', '[url]$3://$4$5[/url]$6', $code);
...

I also added the case for captial letters.
What about 'httP'?
Adding the ignore-case-flag at the end would be a better approach, IMO:

Code: Select all

return preg_replace('/...some regex.../i', '[url]$3://$4$5[/url]$6', $code); // the 'i' at the end makes it ignore cases

Re: Uncoded [url] tags :)

Posted: Fri Aug 07, 2009 8:07 pm
by Eric!
prometheuzz wrote:Adding the ignore-case-flag at the end would be a better approach,
Only if you remember the "i" thing...which I forgot. I don't use regex much.

Re: Uncoded [url] tags :)

Posted: Sat Aug 08, 2009 5:31 pm
by ridgerunner
First off, there is a fundamental problem with your regex. After the initial "match the beginning of the string" (i.e. '^'), the regex starts off with the following negative lookahead:

Code: Select all

(?!\[url(\=.*?)?\](\s)?)
What this says is: "ensure that at this position (the beginning of the string), the following text is not a URL BBCode opening tag". The portion of the regex right after this negative assertion matches the url scheme - i.e. one of the following: http, https or ftp. Well, if the text does match one of the url schemes, then it will certainly never have matched a BBCode URL opening tag! Thus this negative lookahead assertion in the regex does absolutely nothing and serves no useful purpose.

What you really need to do is instead use a negative look behind, which essentially says: "make sure that the beginning of a url is not preceded by either '[ url ]' or '[ url=', and also ensure that the end of the url is not followed by '[ /url ]." I think what you are looking for is something more like this:

Code: Select all

// here is our fully commented regex (note escaped # char in char classes)
$pattern_long = '{
(?<!\[url\]|\[url=)  # ensure url is not preceded by BBCode opening URL tag
\b                   # ensure url begins on a word boundary
(?>                  # start an atomic group (so that negative lookahead will work)
  (                  # capture url into group 1
    (?:https?|ftp)://[-A-Z0-9+&@\#/%?=~_|$!:,.;]*[A-Z0-9+&@\#/%=~_|$]  # match a URL
  )                  # end group 1 capture of url
)                    # end atomic group (so that negative lookahead will work)
(?!\[/url\])         # AND ensure url is not followed by BBCode closing URL tag
}ix';
 
// short version of same regex
$pattern_short = '{(?<!\[url\]|\[url=)\b(?>((?:https?|ftp)://[-A-Z0-9+&@\#/%?=~_|$!:,.;]*[A-Z0-9+&@\#/%=~_|$]))(?!\[/url\])}i';
 
// ok, lets wrap all naked URLs in BBCode URL tags
$text = preg_replace($pattern_short, '[url]$1[/url]', $text);
 
I've added a negative lookbehind before the url and a negative lookahead after the url and have removed the ^ from the start of the regex (I assume you wish to find unwrapped urls at positions other than just at the beginning of the string).

Unfortunately, a lookbehind cannot have an unlimited quantifier (such as a '*' or '+' or '{n,}'), so you cannot write a regex that can ensure that the url is not preceded by a '[ url=...]'. However, in this case, the negative lookahead takes care of not matching this style of BBCode URL syntax. (Note however, that this requires the subtle use of atomic grouping to make it work.)

Hope this helps...

Edit 2010-07-20: Removed link to downloadable script

Re: Uncoded [url] tags :)

Posted: Sat Aug 08, 2009 5:38 pm
by Eran
all this discussion has reminded me of a semi-famous quote:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Re: Uncoded [url] tags :)

Posted: Sun Aug 09, 2009 11:03 am
by jackpf
Wow - so many replies :D I couldn't get online yesterday...sorry all.

Ok, so,
@Eric: nice try - but that's the same as my original code, except without the "^" at the beginning, which, as I explained, will also match links with url tags already around them. Eg, try this: '[url]http://google.com[/url]' (should double up the url tags) Thanks anyway :)

Oh, and the case with the space - I only put the space there so that it wouldn't be converted into bbcode. The idea was to demonstrate that it ignored urls with tags already there.

About the "nice urls" - I actually have a function to do that!! I just use cURL to visit the page the link is pointing to, and extract the url from the page. I'd be happy to share it with out if you want... In fact, you can check out my forum - http://jackpf.co.uk/forum. If you post a link, it should attempt to grab the title for the link.

@prometheuzz: I generally like to escape all non alphanumeric characters...It makes me feel safe ^^
And I need to match that at the end, so that it either matches a space, or the end of the string. If it's a space, I need to put that back in, so it....has the space there :)

And yes, as I was reading through Eric!'s post, I did think why not just use the i modifier. :D

@ridgerunner: that post was immense. Your code works perfectly :) I don't fully understand it I must admit. Regex is not my strong point...but I will research it, and try to understand it. I am in your debt :bow:

And finally, @pytrin: xD Nice quote.

Thanks for your help everyone. I appreciate it :)