Page 1 of 1

Find all URLs to shorten for Twitter

Posted: Sun Sep 05, 2010 8:04 am
by richardxthripp
I am using the Shorten URL in a text using Bit.ly PHP class by Muhammad Arfeen and I would like to replace his code with Philippe Leybaert's code. The original PHP code is below:

Code: Select all

preg_match_all('(((f|ht){1}(tp://|tps://))[-a-zA-Z0-9@:%_+.~#?&//=]+)', $text, $hyperlinksArray);
Here is the same code with Philippe Laybaert's regular expression:

Code: Select all

preg_match_all('\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*\',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*\',]|%\d\d)+)?(?<![,.;])', $text, $hyperlinksArray);
Since he did not enclose his code in quotes, I used single quotes like in the original code and escaped the two single quotes in the regex with backslashes.

My problem is every time this code is parsed it returns “Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in [file] on [line]” and no replacements are made. Is this an issue with the regex itself, escaping, or something with PHP? I am using PHP 5. An example of $text would be "Check out this cool Wikipedia page: http://en.wikipedia.org/wiki/Textile_(disambiguation)" and the last string is defined as `$hyperlinksArray = array();`.

The problem with the original regex is that it does not work with parentheses and it includes periods, commas, semicolons, and close parenthesis at the end of a URL in the URL itself, which, in all cases except semicolons, is NOT what Twitter does. Since I am shortening URLs with Bit.ly, using the original code on my example will produce a Bit.ly short URL for the long URL http://en.wikipedia.org/wiki/Textile_ instead of http://en.wikipedia.org/wiki/Textile_(disambiguation) as it should. Philippe's code would solve these problems and several others if I could get it to work.

Re: Find all URLs to shorten for Twitter

Posted: Sun Sep 05, 2010 9:09 am
by Weirdan
This is the issue with regex (in fact with both of them). Proper regex must include a delimiter, like this:

Code: Select all

preg_match('/\b+/', $something); // here, '/' is the delimiter
preg_match('%\b+%', $something); // here, '%' is used as a delimiter

Re: Find all URLs to shorten for Twitter

Posted: Sun Sep 05, 2010 9:49 am
by richardxthripp
Thanks... That makes sense. I tried using % but got the error "Unknown modifier '\' in [file] on [line]" so I used ~ as my delimiter. The code does not generate errors now but apparently the actual regex is also flawed:

[syntax]preg_match_all('~\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*\',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*\',]|%\d\d)+)?(?<![,.;])~', $text, $hyperlinksArray);[/syntax]

The code integrated with the Bit.ly class turns http://en.wikipedia.org/wiki/Textile_(disambiguation) into http://bit.ly/9bvDhz/Textile_(disambiguation) and http://www.google.com/ http://bit.ly/cEf7lB/. Apparently, if the URL ends in a trailing slash or contains more than one subdirectory, that part is cut off. http://en.wikipedia.org/wiki/Textile_(disambiguation) becomes http://en.wikipedia.org/wiki. This works fine with URLs like the ones on this forum, but not with most URLs. Can you help me fix this regex or recommend a better expression?

Re: Find all URLs to shorten for Twitter

Posted: Mon Sep 06, 2010 5:20 am
by ridgerunner
Both of the regexes you cite have problems (too many for me to go into detail here). But regarding the problem of extracting a URL from text to be "linkified", I've spent considerable time recently working on a solution. First, take a look at this thread which discusses this topic in detail: "The Problem With URLs". To summarize, here are the primary difficulties:
  1. A link may be immediately followed by punctuation that should not be included as part of the link (even though the punctuation character is itself a valid URL character). And the punctuation may be combined with a closing quote character. e.g.

    Code: Select all

    "This link: http://example.com/ends/with/a/comma, should not be
    confused with this one: http://example.com/ends/with/a/period/then/a/quote."
  2. A link is frequently enclosed within (parentheses), <angle brackets>, [square brackets], 'single quotes' or "double quotes". Most of these delimiters are themselves valid URL characters that must not be included as part of the link. e.g. Here's a particularly difficult one, wrapped in parentheses, which contains a single quote and parentheses:

    Code: Select all

    (http://example.com/path/(file's_name.txt))
  3. Any link that has already been linkified (i.e. <a href="http://example.com">LINK</a>) should be ignored by the matching regex.
The first problem is not terribly difficult to solve (one of your regexes solves this using negative lookbehind.) The second and third problems are quite a bit trickier to solve (especially if you are using a regex flavor (such as Javascript), which does not have lookbehind). To make a long story short, I've come up with a (non-trivial) regex that solves these problems and reliably picks out URLs, and have posted a short page which demonstrates its usage: linkify.html. I've released the code as open source and you can download the Javascript and PHP scripts from the Github repository.

Hope this helps!
:)

Re: Find all URLs to shorten for Twitter

Posted: Wed Sep 08, 2010 4:33 pm
by richardxthripp
Thanks, that looks like just what I was looking for. I will likely be using it in a project released this week (Tweet This 1.7.4 plugin for WordPress) and I will credit you. :)

Re: Find all URLs to shorten for Twitter

Posted: Thu Sep 09, 2010 3:17 pm
by ridgerunner
You're welcome. Be sure to grab the latest version - I made significant changes to the script yesterday to fix a bug (It wasn't allowing certain HTML entities inside the URL). The lastest corrected version is: Version 1.0 20100908_1700.

I hope someone gets some good use out of it!
:)

Re: Find all URLs to shorten for Twitter

Posted: Thu Sep 09, 2010 11:05 pm
by richardxthripp
I used the latest version and released Tweet This 1.7.4 with it. I changed your function to just add a space around the URLs it finds and then I use a different function to convert those to Bit.ly URLs.

Re: Find all URLs to shorten for Twitter

Posted: Mon Sep 13, 2010 11:43 am
by ridgerunner
I just looked at the code of your project. Although you mention my name (thanks for that), you neglected to add the required copyright notice. I released this code under the (very liberal) MIT license and you *must* include a copy of that (two line) notice if you copy and paste the code.

My main concern is that anyone who reads your code should be able to find the original work (in case I find and fix some bugs that your code has not kept up with.) So in your next release, please add the required copyright notice and include a link to the Github project page. i.e. something like this:

Code: Select all

/**
 * Delimits URLs by adding a space on each side.
 * This function based on: http://github.com/jmrware/LinkifyURL
 * Copyright:   (c) 2010 Jeff Roberson - http://jmrware.com
 * MIT License: http://www.opensource.org/licenses/mit-license.php
 */
function tt_delimit_urls($text, $delimiter = ' ') { ...
I just changed my source code to include a link to the Github page in the comment header of each regex, so this shouldn't happen again...
But no biggie really...
:)

Re: Find all URLs to shorten for Twitter

Posted: Tue Sep 14, 2010 11:41 pm
by richardxthripp
Okay, I will do this on Friday. You should consider using the GPL as it is more widely known I think.

Re: Find all URLs to shorten for Twitter

Posted: Wed Sep 15, 2010 7:47 pm
by ridgerunner
Actually, the MIT/X11 license is freer and less restrictive than GPL. (and yet is also compatible with GPL.)
Check out this article... Why I Don't Use the GPL.
:)

Re: Find all URLs to shorten for Twitter

Posted: Mon Sep 20, 2010 12:37 am
by richardxthripp
Thanks for the reply, I forgot to check this thread again. I noticed the jQuery library says "Dual licensed under the MIT or GPL Version 2 licenses" so you could also consider that. Anyway, I am keeping my project GPL because most WordPress plugins do that and I don't think anyone would want to use any of my code in a closed-source project anyway.