Twitter-style Hashtags

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
IceCreamYou
Forum Newbie
Posts: 2
Joined: Thu Oct 22, 2009 10:06 pm

Twitter-style Hashtags

Post by IceCreamYou »

Hello,

I'm trying to parse some text for something similar to Hashtags like Twitter has. I think the best way to illustrate this is with an example. I'm using this code to test, but it's really the pattern that's what's important.

Code: Select all

<pre><?php
//An array of example text.
$a = array(
  '#XXX', //A hashtag by itself.
  'llll #XXX', //A hashtag at the end of a string.
  '#XXX llll', //A hashtag at the beginning of a string.
  '#XXX#OOOO', //A hashtag with another directly appended.
  '#XX-XX', //A hashtag with hyphen(s) in it.
  '#gäb', //A hashtag with unicode character(s).
  'llll [#XXX OOOO] llll', //A hashtag with spaces (but not linebreaks) in it, surrounded by square brackets.
);
//For each of the example texts, test to see if the desired output is produced.
foreach ($a as $b) {
  $pattern = '%(\A#(\w|(\p{L}\p{M})|-)+\b)|((?<=\s)#(\w|(\p{L}\p{M})|-)+\b)|((?<=\[)#.+(?=\]))%Uu';
  preg_match_all($pattern, ($b), $matches);
  echo "$b: ". implode(', ', $matches[0]) ."\n";
}
?></pre>
This code currently produces this output:

Code: Select all

#XXX: #XXX
llll #XXX: #XXX
#XXX llll: #XXX
#XXX#OOOO: #XXX
#XX-XX: #XX
#gäb: #g
llll [#XXX OOOO] llll: #XXX OOOO
I want it to produce this output:

Code: Select all

#XXX: #XXX
llll #XXX: #XXX
#XXX llll: #XXX
#XXX#OOOO: #XXX
[color=#0000FF]#XX-XX: #XX-XX[/color]
[color=#0000FF]#gäb: #gäb[/color]
llll [#XXX OOOO] llll: #XXX OOOO
The part that's not working is the attempt to match unicode characters and hyphens after a hash:

Code: Select all

(\w|(\p{L}\p{M})|-)
I've also tried doing it as a character class with the same result:

Code: Select all

[\w(\p{L}\p{M})-]
Any assistance is greatly appreciated.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Twitter-style Hashtags

Post by ridgerunner »

Ok the first problem is DON'T USE THE U UNGREEDY MODIFIER! According to Jeffrey Friedl (and I certainly agree), it is generally bad practice and will just confuse those who try to read it. If you need to make a quantifier ungreedy, simply append the ? where it is needed. In your regex the (ungreedy) + quantifier applied the your first alternative is failing to match the whole #XX-XX test case because the \b matches right before the dash char (which is a word boundary). Removing the U ungreedy modifier fixes this first problem because the + is then able to match the whole thing. But once you remove the U ungreedy global modifier, it becomes appropriate to add a ? to the .+ quantifier on the last alternative (the one which matches the hashtag within square brackets), because this quantifier should be ungreedy.

Regarding the UTF char mis-match. Experimenting with this yielded mixed results for me - (and note that I am not a Unicode expert.) When I run your regex in a PHP script through a web page on Apache, it actually matches correctly as-is. When I run it as a script from a WinXP command line prompt, it matches but displays a different character altogether. When running it through RegexBuddy, I get the same mismatch as you describe in your post. So using RegexBuddy, I discovered that if you make the \p{M} unicode mark property characters optional (by appending a ?), the regex matches all your test cases. I think your 'ä' char may not actually have the mark property.

Here is the fixed version:

Code: Select all

$pattern = '%(\A#(\w|(\p{L}\p{M}?)|-)+\b)|((?<=\s)#(\w|(\p{L}\p{M}?)|-)+\b)|((?<=\[)#.+?(?=\]))%u';
Hope this helps!
IceCreamYou
Forum Newbie
Posts: 2
Joined: Thu Oct 22, 2009 10:06 pm

Re: Twitter-style Hashtags

Post by IceCreamYou »

Great! That works beautifully. Thanks!
Post Reply