REGEX of IRC raw messages

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

dab
Forum Newbie
Posts: 7
Joined: Tue May 20, 2008 1:09 am

REGEX of IRC raw messages

Post by dab »

Hey there. I tried doing a quick search of this site (as well as a 2 day search of google along with self testing methods myself) figuring out how to regex an irc message.

Before I ask you lovely community what regex I might be able to use, I would like to know one thing.
Is there a way to return everything between the certain symbols?

:Nick!Host@name.here PRIVMSG #channel :Good morning world
I would LIKE to easily break that above string into the pieces:
Of course, the message portion would be an unnumbered list of items. I'm hoping something like this IS possible. I've tried this regex, but it returns the first portion before the message:

Code: Select all

:[\d\w\s_-^`]+![\d\w\s_-]+@[\d\w\s_\.-]+ [\d\w\s#]+[:\d\w]+? [:+-]
Using this code (Sorry it's compressed, I typed it for my IRC bot)

Code: Select all

 
reg_match( "/:[\d\w\s_\-\^`]+![\d\w\s_\-\^`]+@[\d\w\s_\.-]+ [\d\w\s#]+[:\d\w]+? [:+-]/", implode(' ',$ex), $aTemp);
print_r($aTemp);
 
So, if it's possible to return the above list with a regex, what could I use? I'd like it to match:
:Nick!Host@name.here JOIN :#channel
:NickServ!Host@name.here NOTICE Nick :Password accepted - you are now recognized.
:Nick!Host@name.here MODE #channel +-modes To Apply
:Nick!Host@name.here TOPIC #nick :Message for new Topic

I appreciate your time. This is taken up much of my little time I have. ;) If you have any question, concerns, or sanity concerns in general, just ask :P
Thank you.

Edit: Oh snap, got ahead of myself. Added a few more things xD
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: REGEX of IRC raw messages

Post by prometheuzz »

Something like this perhaps?

Code: Select all

#!/usr/bin/php
<?php
$tests = array(
  ":Nick!Host@name.here PRIVMSG #channel :Good morning world",
  ":Nick!Host@name.here JOIN :#channel",
  ":NickServ!Host@name.here NOTICE Nick :Password accepted - you are now recognized.",
  ":Nick!Host@name.here MODE #channel +-modes To Apply",
  ":Nick!Host@name.here TOPIC #nick :Message for new Topic",
);
 
$regex = '{
  :([^!]++)!
  ([^\s]++)\s++
  ([^\s]++)\s++
  :?+([^\s]++)\s*+
  (?:[:+-]++(.*+))? 
}x';
 
foreach($tests as $t) {
  if(preg_match($regex, $t, $matches)) {
    print_r($matches);
  } else {
    print "No match for: $t\n";
  }
}
?>
The last optional match (the message) will have to be split if you wish all words to be separated.
Note that the matches at index 0 of the $matches array is just the complete string, of course. The "real" matches start at index 1.
dab
Forum Newbie
Posts: 7
Joined: Tue May 20, 2008 1:09 am

Re: REGEX of IRC raw messages

Post by dab »

Cheers! I haven't had a chance to test it yet, but I'm going to assume it works. :P

While I'm away, could you, or anyone for that matter explain what each little bit does? I'm still learning regex (If you can't already tell)

Thanks so much for this though :D
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: REGEX of IRC raw messages

Post by prometheuzz »

dab wrote:Cheers! I haven't had a chance to test it yet, but I'm going to assume it works. :P

While I'm away, could you, or anyone for that matter explain what each little bit does? I'm still learning regex (If you can't already tell)

Thanks so much for this though :D
You know what, here's the output after running my code:

Code: Select all

Array
(
    [0] => :Nick!Host@name.here PRIVMSG #channel :Good morning world
    [1] => Nick
    [2] => Host@name.here
    [3] => PRIVMSG
    [4] => #channel
    [5] => Good morning world
)
Array
(
    [0] => :Nick!Host@name.here JOIN :#channel
    [1] => Nick
    [2] => Host@name.here
    [3] => JOIN
    [4] => #channel
)
Array
(
    [0] => :NickServ!Host@name.here NOTICE Nick :Password accepted - you are now recognized.
    [1] => NickServ
    [2] => Host@name.here
    [3] => NOTICE
    [4] => Nick
    [5] => Password accepted - you are now recognized.
)
Array
(
    [0] => :Nick!Host@name.here MODE #channel +-modes To Apply
    [1] => Nick
    [2] => Host@name.here
    [3] => MODE
    [4] => #channel
    [5] => modes To Apply
)
Array
(
    [0] => :Nick!Host@name.here TOPIC #nick :Message for new Topic
    [1] => Nick
    [2] => Host@name.here
    [3] => TOPIC
    [4] => #nick
    [5] => Message for new Topic
)
After you confirming it's correct, I'll explain a bit about it, else I'm explaining stuff which you're not going to use.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: REGEX of IRC raw messages

Post by GeertDD »

@ prometheuzz: small tip. [^\s] can be written as \S. It makes things just a tad more readable in my opinion.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: REGEX of IRC raw messages

Post by prometheuzz »

GeertDD wrote:@ prometheuzz: small tip. [^\s] can be written as \S. It makes things just a tad more readable in my opinion.
Yeah, thanks, I keep forgetting those \D, \S and \W negation-classes.
dab
Forum Newbie
Posts: 7
Joined: Tue May 20, 2008 1:09 am

Re: REGEX of IRC raw messages

Post by dab »

aha, even after testing it myself, it works wonderfully. The words per array index was just something I assumed you had to do. This works great! :D

The main reason I'd hop you'd explain it, was so I could then fix it, if it didn't quite work the way I did need ;)

Anywho, I'd love to see how this regex is broken down.

Edit: Also, I read it's possible to make the arrays use names for the index. Indices such as ['nick'] ['host'] etc. Would you mind showing me how to do that? I haven't looked into it yet, as I wanted to get a working regex working before I worried about it :P
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: REGEX of IRC raw messages

Post by GeertDD »

Use the (?P<nick>\S++) pattern for named captures.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: REGEX of IRC raw messages

Post by prometheuzz »

dab wrote:aha, even after testing it myself, it works wonderfully. The words per array index was just something I assumed you had to do. This works great! :D

The main reason I'd hop you'd explain it, was so I could then fix it, if it didn't quite work the way I did need ;)

Anywho, I'd love to see how this regex is broken down.
Good to hear it. Here are some details:
First, I used a { ... }x notation to construct the regex. This will
ignore all whites pace characters and new line characters in your regex, which
will let you create a regex over multiple lines. This is especially handy when
creating larger regexes, otherwise you would get one large and ugly monster!

Also, I used quite a bit of possessive quantifiers in my regex for performance
reasons (and because otherwise Geert would become angry with me! ;)). For
simplicity I will not go into them, but I encourage you to do some reading on
them yourself [1].

As Geert pointed out: [^\s] (which means any character except a white space
character) can be replaced by the shorter \S

So, the (slightly) simpler regex (without the possessive quantifiers and
\S instead of [^\s]) now looks like this:

Code: Select all

$regex = '{
  :([^!]+)!
  (\S+)\s+
  (\S+)\s+
  :?(\S+)\s*
  (?:[:+-]+(.*))?
}x';
(test this new regex, you will see it produces the same output)

You see me use quite a bit of parenthesis. These are used to "group" characters
together and store them in the $matches array. After running the following
example:

Code: Select all

if(preg_match('/(.)(.)(.)/', 'abc', $matches)) {
  print_r($matches);
}
you will see that the $matches array will hold 4 values: index 0 will hold the
entire match and index 1=a, index 2=b and index 3=c.
Now to make a group (the stuff between the parenthesis) optional, you can add
a question mark after it like this:

Code: Select all

if(preg_match('/(.)(.)(.)?/', 'ab', $matches)) { // the third group is optional
  print_r($matches);
}
But in my IRC-regex I sometimes use the question mark inside a group followed
by a semi colon. This will cause the regex engine to NOT add that group to the
$matches array. To understand what I mean by that, run the following snippet:

Code: Select all

if(preg_match('/(.)(?:.)(.)/', 'abc', $matches)) {
  print_r($matches);
}
as you noticed, it has caused the 2nd character to be left out of the $matches
array.

Now, to get back to the IRC-regex, here's a brief explanation. Note that I
used << and >> in my explanation to indicate the groups/matches.

Code: Select all

:([^!]+)!         // a ':' followed by << one or more non-'!' chars >> followed by a '!'
(\S+)\s+          // << one or more non-white spaces >> followed by one ore more white spaces
(\S+)\s+          // the same as the above
:?(\S+)\s*        // an optional ':' followed by << one or more non-white spaces >> followed by 
                  // zero ore more white spaces
(?:[:+-]+(.*))?   // see below
 
 
// The last group I'll explain over a couple of lines:
(?:               // start group, but do not store in $matches!!! 
  [:+-]+          // one or more of the following chars: ':', '+' or '-'
                  // because of the preceding '?:' these will not be grouped
  (.*)            // << zero or more chars of any type >>
)?                // end of the group, AND this group is optional!!!
dab wrote:Edit: Also, I read it's possible to make the arrays use names for the index. Indices such as ['nick'] ['host'] etc. Would you mind showing me how to do that? I haven't looked into it yet, as I wanted to get a working regex working before I worried about it :P
As Geert already pointed out, you can do that by grouping your matches like this: (?P<name>group), which may look a bit confusing, so I'll give you a little demo:

Code: Select all

$regex = '{
  :          (?P<NICK>    ([^!]+) ) !
             (?P<HOST>    (\S+)   ) \s+
             (?P<ACTION>  (\S+)   ) \s+
  :?         (?P<CHANNEL> (\S+)   ) \s*
  (?: [:+-]+ (?P<MESSAGE> (.*)    ) )?
}x';
 
foreach($tests as $t) {
  if(preg_match($regex, $t, $matches)) {
    print "ACTION=" . $matches['ACTION'] . "\n";
  } 
}
Remember that the new lines and white spaces are ignored when constructing a regex like { ... }x

HTH


[1] http://www.regular-expressions.info/possessive.html
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: REGEX of IRC raw messages

Post by GeertDD »

Good explanation, prometheuzz!

One remark I would like to make about the named captures is that the parentheses used, are capturing parentheses (of course). So, you should not rewrap things inside them. Those only result in yet another double (numeric) capture.

Example: (?P<HOST>(\S+)) becomes (?P<HOST>\S+)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: REGEX of IRC raw messages

Post by prometheuzz »

GeertDD wrote:Good explanation, prometheuzz!
Thanks!

GeertDD wrote:One remark I would like to make about the named captures is that the parentheses used, are capturing parentheses (of course). So, you should not rewrap things inside them. Those only result in yet another double (numeric) capture.

Example: (?P<HOST>(\S+)) becomes (?P<HOST>\S+)
Ah, yes, that makes sense (I didn't test that last regex very thoroughly).
Thanks Geert.
dab
Forum Newbie
Posts: 7
Joined: Tue May 20, 2008 1:09 am

Re: REGEX of IRC raw messages

Post by dab »

Ok. I spent some time playing with your regex. I placed it in my IRC bot's code, and noticed a decrease of 30% in processor usage! That's Excellent news!

I noticed that using smilies as my first word ie( : D) would make the : disappear. So I fixed it by removing the + outside the [:+-]. Hopefully that might do something. Although... now that I think about it, it will ruin the modes for when you add and subtract modes. Hmm Any idea how to fix that? :+- are removed completely when used first in a message such as:

:::+-+-:+-+Hello world
That turns out to be
hello world.

Hmm hopefully you could provide an answer. :D

Again, thank you guys so much for your time, and knowledge. :)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: REGEX of IRC raw messages

Post by prometheuzz »

dab wrote:Ok. I spent some time playing with your regex. I placed it in my IRC bot's code, and noticed a decrease of 30% in processor usage! That's Excellent news!

I noticed that using smilies as my first word ie( : D) would make the : disappear. So I fixed it by removing the + outside the [:+-]. Hopefully that might do something. Although... now that I think about it, it will ruin the modes for when you add and subtract modes. Hmm Any idea how to fix that? :+- are removed completely when used first in a message such as:

:::+-+-:+-+Hello world
That turns out to be
hello world.

Hmm hopefully you could provide an answer. :D

Again, thank you guys so much for your time, and knowledge. :)
No problem. Try replacing the last line of the regex with this one:

Code: Select all

(?: (?::|[+-]+) (?P<MESSAGE> .*  ) )?
If that does not result in the desired output, you will need to give some examples of what you mean exactly.

HTH
dab
Forum Newbie
Posts: 7
Joined: Tue May 20, 2008 1:09 am

Re: REGEX of IRC raw messages

Post by dab »

Great! That worked wonderfully!

Thank you guys so much. We sped up the bots, and I learned a bit more about Regex. I'd say this mission was a success.

Thanks. :D
deshuz747
Forum Newbie
Posts: 1
Joined: Thu May 22, 2008 9:35 am

Re: REGEX of IRC raw messages

Post by deshuz747 »

Does any budy make regex for the user agent to recognize from which OS request come from like this useragent

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_2; en-us) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13

i need a regular expression for that kind of user agent

please help me
Post Reply