Page 1 of 2

REGEX of IRC raw messages

Posted: Tue May 20, 2008 1:18 am
by dab
Hey there. I tried doing a quick search of this site (as well as a 2 day search of google along with self testing methods myself) figuring out how to regex an irc message.

Before I ask you lovely community what regex I might be able to use, I would like to know one thing.
Is there a way to return everything between the certain symbols?

:Nick!Host@name.here PRIVMSG #channel :Good morning world
I would LIKE to easily break that above string into the pieces:
Of course, the message portion would be an unnumbered list of items. I'm hoping something like this IS possible. I've tried this regex, but it returns the first portion before the message:

Code: Select all

:[\d\w\s_-^`]+![\d\w\s_-]+@[\d\w\s_\.-]+ [\d\w\s#]+[:\d\w]+? [:+-]
Using this code (Sorry it's compressed, I typed it for my IRC bot)

Code: Select all

 
reg_match( "/:[\d\w\s_\-\^`]+![\d\w\s_\-\^`]+@[\d\w\s_\.-]+ [\d\w\s#]+[:\d\w]+? [:+-]/", implode(' ',$ex), $aTemp);
print_r($aTemp);
 
So, if it's possible to return the above list with a regex, what could I use? I'd like it to match:
:Nick!Host@name.here JOIN :#channel
:NickServ!Host@name.here NOTICE Nick :Password accepted - you are now recognized.
:Nick!Host@name.here MODE #channel +-modes To Apply
:Nick!Host@name.here TOPIC #nick :Message for new Topic

I appreciate your time. This is taken up much of my little time I have. ;) If you have any question, concerns, or sanity concerns in general, just ask :P
Thank you.

Edit: Oh snap, got ahead of myself. Added a few more things xD

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 2:31 am
by prometheuzz
Something like this perhaps?

Code: Select all

#!/usr/bin/php
<?php
$tests = array(
  ":Nick!Host@name.here PRIVMSG #channel :Good morning world",
  ":Nick!Host@name.here JOIN :#channel",
  ":NickServ!Host@name.here NOTICE Nick :Password accepted - you are now recognized.",
  ":Nick!Host@name.here MODE #channel +-modes To Apply",
  ":Nick!Host@name.here TOPIC #nick :Message for new Topic",
);
 
$regex = '{
  :([^!]++)!
  ([^\s]++)\s++
  ([^\s]++)\s++
  :?+([^\s]++)\s*+
  (?:[:+-]++(.*+))? 
}x';
 
foreach($tests as $t) {
  if(preg_match($regex, $t, $matches)) {
    print_r($matches);
  } else {
    print "No match for: $t\n";
  }
}
?>
The last optional match (the message) will have to be split if you wish all words to be separated.
Note that the matches at index 0 of the $matches array is just the complete string, of course. The "real" matches start at index 1.

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 8:52 am
by dab
Cheers! I haven't had a chance to test it yet, but I'm going to assume it works. :P

While I'm away, could you, or anyone for that matter explain what each little bit does? I'm still learning regex (If you can't already tell)

Thanks so much for this though :D

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 9:00 am
by prometheuzz
dab wrote:Cheers! I haven't had a chance to test it yet, but I'm going to assume it works. :P

While I'm away, could you, or anyone for that matter explain what each little bit does? I'm still learning regex (If you can't already tell)

Thanks so much for this though :D
You know what, here's the output after running my code:

Code: Select all

Array
(
    [0] => :Nick!Host@name.here PRIVMSG #channel :Good morning world
    [1] => Nick
    [2] => Host@name.here
    [3] => PRIVMSG
    [4] => #channel
    [5] => Good morning world
)
Array
(
    [0] => :Nick!Host@name.here JOIN :#channel
    [1] => Nick
    [2] => Host@name.here
    [3] => JOIN
    [4] => #channel
)
Array
(
    [0] => :NickServ!Host@name.here NOTICE Nick :Password accepted - you are now recognized.
    [1] => NickServ
    [2] => Host@name.here
    [3] => NOTICE
    [4] => Nick
    [5] => Password accepted - you are now recognized.
)
Array
(
    [0] => :Nick!Host@name.here MODE #channel +-modes To Apply
    [1] => Nick
    [2] => Host@name.here
    [3] => MODE
    [4] => #channel
    [5] => modes To Apply
)
Array
(
    [0] => :Nick!Host@name.here TOPIC #nick :Message for new Topic
    [1] => Nick
    [2] => Host@name.here
    [3] => TOPIC
    [4] => #nick
    [5] => Message for new Topic
)
After you confirming it's correct, I'll explain a bit about it, else I'm explaining stuff which you're not going to use.

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 9:02 am
by GeertDD
@ prometheuzz: small tip. [^\s] can be written as \S. It makes things just a tad more readable in my opinion.

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 9:27 am
by prometheuzz
GeertDD wrote:@ prometheuzz: small tip. [^\s] can be written as \S. It makes things just a tad more readable in my opinion.
Yeah, thanks, I keep forgetting those \D, \S and \W negation-classes.

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 5:45 pm
by dab
aha, even after testing it myself, it works wonderfully. The words per array index was just something I assumed you had to do. This works great! :D

The main reason I'd hop you'd explain it, was so I could then fix it, if it didn't quite work the way I did need ;)

Anywho, I'd love to see how this regex is broken down.

Edit: Also, I read it's possible to make the arrays use names for the index. Indices such as ['nick'] ['host'] etc. Would you mind showing me how to do that? I haven't looked into it yet, as I wanted to get a working regex working before I worried about it :P

Re: REGEX of IRC raw messages

Posted: Tue May 20, 2008 11:41 pm
by GeertDD
Use the (?P<nick>\S++) pattern for named captures.

Re: REGEX of IRC raw messages

Posted: Wed May 21, 2008 4:22 am
by prometheuzz
dab wrote:aha, even after testing it myself, it works wonderfully. The words per array index was just something I assumed you had to do. This works great! :D

The main reason I'd hop you'd explain it, was so I could then fix it, if it didn't quite work the way I did need ;)

Anywho, I'd love to see how this regex is broken down.
Good to hear it. Here are some details:
First, I used a { ... }x notation to construct the regex. This will
ignore all whites pace characters and new line characters in your regex, which
will let you create a regex over multiple lines. This is especially handy when
creating larger regexes, otherwise you would get one large and ugly monster!

Also, I used quite a bit of possessive quantifiers in my regex for performance
reasons (and because otherwise Geert would become angry with me! ;)). For
simplicity I will not go into them, but I encourage you to do some reading on
them yourself [1].

As Geert pointed out: [^\s] (which means any character except a white space
character) can be replaced by the shorter \S

So, the (slightly) simpler regex (without the possessive quantifiers and
\S instead of [^\s]) now looks like this:

Code: Select all

$regex = '{
  :([^!]+)!
  (\S+)\s+
  (\S+)\s+
  :?(\S+)\s*
  (?:[:+-]+(.*))?
}x';
(test this new regex, you will see it produces the same output)

You see me use quite a bit of parenthesis. These are used to "group" characters
together and store them in the $matches array. After running the following
example:

Code: Select all

if(preg_match('/(.)(.)(.)/', 'abc', $matches)) {
  print_r($matches);
}
you will see that the $matches array will hold 4 values: index 0 will hold the
entire match and index 1=a, index 2=b and index 3=c.
Now to make a group (the stuff between the parenthesis) optional, you can add
a question mark after it like this:

Code: Select all

if(preg_match('/(.)(.)(.)?/', 'ab', $matches)) { // the third group is optional
  print_r($matches);
}
But in my IRC-regex I sometimes use the question mark inside a group followed
by a semi colon. This will cause the regex engine to NOT add that group to the
$matches array. To understand what I mean by that, run the following snippet:

Code: Select all

if(preg_match('/(.)(?:.)(.)/', 'abc', $matches)) {
  print_r($matches);
}
as you noticed, it has caused the 2nd character to be left out of the $matches
array.

Now, to get back to the IRC-regex, here's a brief explanation. Note that I
used << and >> in my explanation to indicate the groups/matches.

Code: Select all

:([^!]+)!         // a ':' followed by << one or more non-'!' chars >> followed by a '!'
(\S+)\s+          // << one or more non-white spaces >> followed by one ore more white spaces
(\S+)\s+          // the same as the above
:?(\S+)\s*        // an optional ':' followed by << one or more non-white spaces >> followed by 
                  // zero ore more white spaces
(?:[:+-]+(.*))?   // see below
 
 
// The last group I'll explain over a couple of lines:
(?:               // start group, but do not store in $matches!!! 
  [:+-]+          // one or more of the following chars: ':', '+' or '-'
                  // because of the preceding '?:' these will not be grouped
  (.*)            // << zero or more chars of any type >>
)?                // end of the group, AND this group is optional!!!
dab wrote:Edit: Also, I read it's possible to make the arrays use names for the index. Indices such as ['nick'] ['host'] etc. Would you mind showing me how to do that? I haven't looked into it yet, as I wanted to get a working regex working before I worried about it :P
As Geert already pointed out, you can do that by grouping your matches like this: (?P<name>group), which may look a bit confusing, so I'll give you a little demo:

Code: Select all

$regex = '{
  :          (?P<NICK>    ([^!]+) ) !
             (?P<HOST>    (\S+)   ) \s+
             (?P<ACTION>  (\S+)   ) \s+
  :?         (?P<CHANNEL> (\S+)   ) \s*
  (?: [:+-]+ (?P<MESSAGE> (.*)    ) )?
}x';
 
foreach($tests as $t) {
  if(preg_match($regex, $t, $matches)) {
    print "ACTION=" . $matches['ACTION'] . "\n";
  } 
}
Remember that the new lines and white spaces are ignored when constructing a regex like { ... }x

HTH


[1] http://www.regular-expressions.info/possessive.html

Re: REGEX of IRC raw messages

Posted: Wed May 21, 2008 7:15 am
by GeertDD
Good explanation, prometheuzz!

One remark I would like to make about the named captures is that the parentheses used, are capturing parentheses (of course). So, you should not rewrap things inside them. Those only result in yet another double (numeric) capture.

Example: (?P<HOST>(\S+)) becomes (?P<HOST>\S+)

Re: REGEX of IRC raw messages

Posted: Wed May 21, 2008 7:23 am
by prometheuzz
GeertDD wrote:Good explanation, prometheuzz!
Thanks!

GeertDD wrote:One remark I would like to make about the named captures is that the parentheses used, are capturing parentheses (of course). So, you should not rewrap things inside them. Those only result in yet another double (numeric) capture.

Example: (?P<HOST>(\S+)) becomes (?P<HOST>\S+)
Ah, yes, that makes sense (I didn't test that last regex very thoroughly).
Thanks Geert.

Re: REGEX of IRC raw messages

Posted: Thu May 22, 2008 1:47 am
by dab
Ok. I spent some time playing with your regex. I placed it in my IRC bot's code, and noticed a decrease of 30% in processor usage! That's Excellent news!

I noticed that using smilies as my first word ie( : D) would make the : disappear. So I fixed it by removing the + outside the [:+-]. Hopefully that might do something. Although... now that I think about it, it will ruin the modes for when you add and subtract modes. Hmm Any idea how to fix that? :+- are removed completely when used first in a message such as:

:::+-+-:+-+Hello world
That turns out to be
hello world.

Hmm hopefully you could provide an answer. :D

Again, thank you guys so much for your time, and knowledge. :)

Re: REGEX of IRC raw messages

Posted: Thu May 22, 2008 2:18 am
by prometheuzz
dab wrote:Ok. I spent some time playing with your regex. I placed it in my IRC bot's code, and noticed a decrease of 30% in processor usage! That's Excellent news!

I noticed that using smilies as my first word ie( : D) would make the : disappear. So I fixed it by removing the + outside the [:+-]. Hopefully that might do something. Although... now that I think about it, it will ruin the modes for when you add and subtract modes. Hmm Any idea how to fix that? :+- are removed completely when used first in a message such as:

:::+-+-:+-+Hello world
That turns out to be
hello world.

Hmm hopefully you could provide an answer. :D

Again, thank you guys so much for your time, and knowledge. :)
No problem. Try replacing the last line of the regex with this one:

Code: Select all

(?: (?::|[+-]+) (?P<MESSAGE> .*  ) )?
If that does not result in the desired output, you will need to give some examples of what you mean exactly.

HTH

Re: REGEX of IRC raw messages

Posted: Thu May 22, 2008 9:10 am
by dab
Great! That worked wonderfully!

Thank you guys so much. We sped up the bots, and I learned a bit more about Regex. I'd say this mission was a success.

Thanks. :D

Re: REGEX of IRC raw messages

Posted: Thu May 22, 2008 9:39 am
by deshuz747
Does any budy make regex for the user agent to recognize from which OS request come from like this useragent

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_2; en-us) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13

i need a regular expression for that kind of user agent

please help me