Find all URLs to shorten for Twitter

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
richardxthripp
Forum Newbie
Posts: 6
Joined: Sun Sep 05, 2010 7:57 am

Find all URLs to shorten for Twitter

Post by richardxthripp »

I am using the Shorten URL in a text using Bit.ly PHP class by Muhammad Arfeen and I would like to replace his code with Philippe Leybaert's code. The original PHP code is below:

Code: Select all

preg_match_all('(((f|ht){1}(tp://|tps://))[-a-zA-Z0-9@:%_+.~#?&//=]+)', $text, $hyperlinksArray);
Here is the same code with Philippe Laybaert's regular expression:

Code: Select all

preg_match_all('\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*\',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*\',]|%\d\d)+)?(?<![,.;])', $text, $hyperlinksArray);
Since he did not enclose his code in quotes, I used single quotes like in the original code and escaped the two single quotes in the regex with backslashes.

My problem is every time this code is parsed it returns “Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in [file] on [line]” and no replacements are made. Is this an issue with the regex itself, escaping, or something with PHP? I am using PHP 5. An example of $text would be "Check out this cool Wikipedia page: http://en.wikipedia.org/wiki/Textile_(disambiguation)" and the last string is defined as `$hyperlinksArray = array();`.

The problem with the original regex is that it does not work with parentheses and it includes periods, commas, semicolons, and close parenthesis at the end of a URL in the URL itself, which, in all cases except semicolons, is NOT what Twitter does. Since I am shortening URLs with Bit.ly, using the original code on my example will produce a Bit.ly short URL for the long URL http://en.wikipedia.org/wiki/Textile_ instead of http://en.wikipedia.org/wiki/Textile_(disambiguation) as it should. Philippe's code would solve these problems and several others if I could get it to work.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Find all URLs to shorten for Twitter

Post by Weirdan »

This is the issue with regex (in fact with both of them). Proper regex must include a delimiter, like this:

Code: Select all

preg_match('/\b+/', $something); // here, '/' is the delimiter
preg_match('%\b+%', $something); // here, '%' is used as a delimiter
richardxthripp
Forum Newbie
Posts: 6
Joined: Sun Sep 05, 2010 7:57 am

Re: Find all URLs to shorten for Twitter

Post by richardxthripp »

Thanks... That makes sense. I tried using % but got the error "Unknown modifier '\' in [file] on [line]" so I used ~ as my delimiter. The code does not generate errors now but apparently the actual regex is also flawed:

[syntax]preg_match_all('~\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*\',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*\',]|%\d\d)+)?(?<![,.;])~', $text, $hyperlinksArray);[/syntax]

The code integrated with the Bit.ly class turns http://en.wikipedia.org/wiki/Textile_(disambiguation) into http://bit.ly/9bvDhz/Textile_(disambiguation) and http://www.google.com/ http://bit.ly/cEf7lB/. Apparently, if the URL ends in a trailing slash or contains more than one subdirectory, that part is cut off. http://en.wikipedia.org/wiki/Textile_(disambiguation) becomes http://en.wikipedia.org/wiki. This works fine with URLs like the ones on this forum, but not with most URLs. Can you help me fix this regex or recommend a better expression?
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Find all URLs to shorten for Twitter

Post by ridgerunner »

Both of the regexes you cite have problems (too many for me to go into detail here). But regarding the problem of extracting a URL from text to be "linkified", I've spent considerable time recently working on a solution. First, take a look at this thread which discusses this topic in detail: "The Problem With URLs". To summarize, here are the primary difficulties:
  1. A link may be immediately followed by punctuation that should not be included as part of the link (even though the punctuation character is itself a valid URL character). And the punctuation may be combined with a closing quote character. e.g.

    Code: Select all

    "This link: http://example.com/ends/with/a/comma, should not be
    confused with this one: http://example.com/ends/with/a/period/then/a/quote."
  2. A link is frequently enclosed within (parentheses), <angle brackets>, [square brackets], 'single quotes' or "double quotes". Most of these delimiters are themselves valid URL characters that must not be included as part of the link. e.g. Here's a particularly difficult one, wrapped in parentheses, which contains a single quote and parentheses:

    Code: Select all

    (http://example.com/path/(file's_name.txt))
  3. Any link that has already been linkified (i.e. <a href="http://example.com">LINK</a>) should be ignored by the matching regex.
The first problem is not terribly difficult to solve (one of your regexes solves this using negative lookbehind.) The second and third problems are quite a bit trickier to solve (especially if you are using a regex flavor (such as Javascript), which does not have lookbehind). To make a long story short, I've come up with a (non-trivial) regex that solves these problems and reliably picks out URLs, and have posted a short page which demonstrates its usage: linkify.html. I've released the code as open source and you can download the Javascript and PHP scripts from the Github repository.

Hope this helps!
:)
richardxthripp
Forum Newbie
Posts: 6
Joined: Sun Sep 05, 2010 7:57 am

Re: Find all URLs to shorten for Twitter

Post by richardxthripp »

Thanks, that looks like just what I was looking for. I will likely be using it in a project released this week (Tweet This 1.7.4 plugin for WordPress) and I will credit you. :)
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Find all URLs to shorten for Twitter

Post by ridgerunner »

You're welcome. Be sure to grab the latest version - I made significant changes to the script yesterday to fix a bug (It wasn't allowing certain HTML entities inside the URL). The lastest corrected version is: Version 1.0 20100908_1700.

I hope someone gets some good use out of it!
:)
richardxthripp
Forum Newbie
Posts: 6
Joined: Sun Sep 05, 2010 7:57 am

Re: Find all URLs to shorten for Twitter

Post by richardxthripp »

I used the latest version and released Tweet This 1.7.4 with it. I changed your function to just add a space around the URLs it finds and then I use a different function to convert those to Bit.ly URLs.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Find all URLs to shorten for Twitter

Post by ridgerunner »

I just looked at the code of your project. Although you mention my name (thanks for that), you neglected to add the required copyright notice. I released this code under the (very liberal) MIT license and you *must* include a copy of that (two line) notice if you copy and paste the code.

My main concern is that anyone who reads your code should be able to find the original work (in case I find and fix some bugs that your code has not kept up with.) So in your next release, please add the required copyright notice and include a link to the Github project page. i.e. something like this:

Code: Select all

/**
 * Delimits URLs by adding a space on each side.
 * This function based on: http://github.com/jmrware/LinkifyURL
 * Copyright:   (c) 2010 Jeff Roberson - http://jmrware.com
 * MIT License: http://www.opensource.org/licenses/mit-license.php
 */
function tt_delimit_urls($text, $delimiter = ' ') { ...
I just changed my source code to include a link to the Github page in the comment header of each regex, so this shouldn't happen again...
But no biggie really...
:)
richardxthripp
Forum Newbie
Posts: 6
Joined: Sun Sep 05, 2010 7:57 am

Re: Find all URLs to shorten for Twitter

Post by richardxthripp »

Okay, I will do this on Friday. You should consider using the GPL as it is more widely known I think.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Find all URLs to shorten for Twitter

Post by ridgerunner »

Actually, the MIT/X11 license is freer and less restrictive than GPL. (and yet is also compatible with GPL.)
Check out this article... Why I Don't Use the GPL.
:)
richardxthripp
Forum Newbie
Posts: 6
Joined: Sun Sep 05, 2010 7:57 am

Re: Find all URLs to shorten for Twitter

Post by richardxthripp »

Thanks for the reply, I forgot to check this thread again. I noticed the jQuery library says "Dual licensed under the MIT or GPL Version 2 licenses" so you could also consider that. Anyway, I am keeping my project GPL because most WordPress plugins do that and I don't think anyone would want to use any of my code in a closed-source project anyway.
Post Reply