Page 1 of 1

Nested regex, 0 or more times

Posted: Fri Mar 13, 2009 6:25 am
by batfastad
Hi everyone

I'm writing a regular expression to parse page numbers and page ranges.
A user can enter page numbers in the following formats:
45
45,47
45-48
45,47-52
45-48,56,59-62,98-102 etc...
So either a single page, a range of pages, pages separated by a comma or a combination.

And I've written the following regex which matches the first 3 cases

Code: Select all

[0-9]{1,3}(-|,)?[0-9]{0,3}
But I'm having trouble extending this to deal with the more complex examples.
Obviously what I could do is just add the (-|,)?[0-9]{0,3} section multiple times after the initial regex, to deal with the multiple cases in the 4th and 5th examples.

But I'm hoping there's an easier way I'm missing out on. Is it valid to nest regular expression statements?

Just tried something like this...

Code: Select all

[0-9]{1,3}((-|,){1}[0-9]{1,3})*
So I've got the 2nd half of the regex within brackets () then specifying the zero or more quantifier * to that entire bracketed section?
It seems to work in Regex Coach (excellent utility BTW), but I wasn't if I was missing something that could improve/shorten the regex?

Cheers, B

Re: Nested regex, 0 or more times

Posted: Fri Mar 13, 2009 7:22 am
by mintedjo
I'm sure you've already thought about these but I'll mention them anyway.
Your regex also matches "45-46-47-48,extraletters".
You can use '^' at the start and '$' at the end to make sure any additional characters at the start or end are not allowed.
To make it shorter you could use '\d' instead of '[0-9]'.
In '(-|,){1}' the '{1}' is completely pointless. Without it the engine just matches it once anyway.
If you dont need to capture the comma or hyphen then '(-|,)' is the same as a character set containing '-' and ',', that is: '[-,]'

Do you want to allow strings like "41-42-43"?
If not you will need to change it a bit. But if they are acceptable then I don't see any problems with your regex.

I'm guessing as it's user input you are counting on the regex to validate the input, otherwise you would just use split wouldn't you? Once on commas and then split those strings on hyphens.

Re: Nested regex, 0 or more times

Posted: Fri Mar 13, 2009 8:01 am
by batfastad
Ah yes, you're quite right! I've changed it to:

Code: Select all

^\d{1,3}((-|,)\d{1,3})*$
But is there a way to stop 45-48-69 with this regex?

EDIT: I've found a way to stop that by testing against another regex with preg_match_all()

Code: Select all

-\d{1,3}-
Also here's my PHP code which converts that comma/hyphen separated page string into an array of the page numbers the user has chosen:

Code: Select all

if ( strlen($var) > 0 and is_numeric($var) and strpos($var, '-') === FALSE and strpos($var, ',') === FALSE) {
    // SINGLE PAGE
    $page[] = $var;
 
} elseif ( strlen($var) > 0 and (strpos($var, '-') !== FALSE or strpos($var, ',') !== FALSE)) {
    // MULTIPLE PAGES
 
    $var_exp = explode(',', $var);
    sort($var_exp);
 
    // LOOP THROUGH ARRAY
    foreach ($var_exp as $val) {
        if ( substr_count($val, '-') == 1) {
            $val_exp = explode('-', $val);
            $p = $val_exp[0];
            while ($p <= $val_exp[1]) {
                $pages[] = $p;
                $p++;
            }
        } elseif ( substr_count($val, '-') == 0) {
            $pages[] = $val;
        } else {
            // ERROR - INVALID RANGE
        }
        $val_exp = '';
    }
    sort($var_exp);
} else {
    // ERROR
}
Seems to be working ok so far but needs some major improving which I'm working on. Pretty terrible code. Hope this helps someone out though!

Cheers, Ben

Re: Nested regex, 0 or more times

Posted: Sun Mar 15, 2009 6:20 am
by prometheuzz
batfastad wrote:Ah yes, you're quite right! I've changed it to:

Code: Select all

^\d{1,3}((-|,)\d{1,3})*$
But is there a way to stop 45-48-69 with this regex?
...
Yes, something like this (untested!):

Code: Select all

^\d+(-\d+)?(,\d+(-\d+)?)*$

Re: Nested regex, 0 or more times

Posted: Mon Mar 16, 2009 5:19 am
by batfastad
Seems to work perfectly!!
Thanks for the help, will be trying to analyse that to see how it works

Cheers, B

Re: Nested regex, 0 or more times

Posted: Mon Mar 16, 2009 8:09 am
by prometheuzz
batfastad wrote:Seems to work perfectly!!
Thanks for the help, will be trying to analyse that to see how it works

Cheers, B
No problem B.

Think about your problem like this:
- you have a list of valid atoms.
- a list can have either one or more atoms in it.
- the atoms are separated by a comma.

Combining these three observation, we would have this:

Code: Select all

pattern : atom
        | atom,atom
        | atom,atom,atom
        | ...
in other words:

Code: Select all

pattern : atom(,atom)*
Now, your atoms can either be a number or a range where a range equals "number-number". So, that would look like:

Code: Select all

atom : number
     | number-number

in short this is:

Code: Select all

atom : number(-number)?
And to combine everything:

Code: Select all

pattern : number(-number)?(,number(-number)?)*
and in regex, this is:

Code: Select all

\d+(-\d+)?(,\d+(-\d+)?)*
That's all there is to it!