Nested regex, 0 or more times

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Nested regex, 0 or more times

Post by batfastad »

Hi everyone

I'm writing a regular expression to parse page numbers and page ranges.
A user can enter page numbers in the following formats:
45
45,47
45-48
45,47-52
45-48,56,59-62,98-102 etc...
So either a single page, a range of pages, pages separated by a comma or a combination.

And I've written the following regex which matches the first 3 cases

Code: Select all

[0-9]{1,3}(-|,)?[0-9]{0,3}
But I'm having trouble extending this to deal with the more complex examples.
Obviously what I could do is just add the (-|,)?[0-9]{0,3} section multiple times after the initial regex, to deal with the multiple cases in the 4th and 5th examples.

But I'm hoping there's an easier way I'm missing out on. Is it valid to nest regular expression statements?

Just tried something like this...

Code: Select all

[0-9]{1,3}((-|,){1}[0-9]{1,3})*
So I've got the 2nd half of the regex within brackets () then specifying the zero or more quantifier * to that entire bracketed section?
It seems to work in Regex Coach (excellent utility BTW), but I wasn't if I was missing something that could improve/shorten the regex?

Cheers, B
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: Nested regex, 0 or more times

Post by mintedjo »

I'm sure you've already thought about these but I'll mention them anyway.
Your regex also matches "45-46-47-48,extraletters".
You can use '^' at the start and '$' at the end to make sure any additional characters at the start or end are not allowed.
To make it shorter you could use '\d' instead of '[0-9]'.
In '(-|,){1}' the '{1}' is completely pointless. Without it the engine just matches it once anyway.
If you dont need to capture the comma or hyphen then '(-|,)' is the same as a character set containing '-' and ',', that is: '[-,]'

Do you want to allow strings like "41-42-43"?
If not you will need to change it a bit. But if they are acceptable then I don't see any problems with your regex.

I'm guessing as it's user input you are counting on the regex to validate the input, otherwise you would just use split wouldn't you? Once on commas and then split those strings on hyphens.
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Re: Nested regex, 0 or more times

Post by batfastad »

Ah yes, you're quite right! I've changed it to:

Code: Select all

^\d{1,3}((-|,)\d{1,3})*$
But is there a way to stop 45-48-69 with this regex?

EDIT: I've found a way to stop that by testing against another regex with preg_match_all()

Code: Select all

-\d{1,3}-
Also here's my PHP code which converts that comma/hyphen separated page string into an array of the page numbers the user has chosen:

Code: Select all

if ( strlen($var) > 0 and is_numeric($var) and strpos($var, '-') === FALSE and strpos($var, ',') === FALSE) {
    // SINGLE PAGE
    $page[] = $var;
 
} elseif ( strlen($var) > 0 and (strpos($var, '-') !== FALSE or strpos($var, ',') !== FALSE)) {
    // MULTIPLE PAGES
 
    $var_exp = explode(',', $var);
    sort($var_exp);
 
    // LOOP THROUGH ARRAY
    foreach ($var_exp as $val) {
        if ( substr_count($val, '-') == 1) {
            $val_exp = explode('-', $val);
            $p = $val_exp[0];
            while ($p <= $val_exp[1]) {
                $pages[] = $p;
                $p++;
            }
        } elseif ( substr_count($val, '-') == 0) {
            $pages[] = $val;
        } else {
            // ERROR - INVALID RANGE
        }
        $val_exp = '';
    }
    sort($var_exp);
} else {
    // ERROR
}
Seems to be working ok so far but needs some major improving which I'm working on. Pretty terrible code. Hope this helps someone out though!

Cheers, Ben
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Nested regex, 0 or more times

Post by prometheuzz »

batfastad wrote:Ah yes, you're quite right! I've changed it to:

Code: Select all

^\d{1,3}((-|,)\d{1,3})*$
But is there a way to stop 45-48-69 with this regex?
...
Yes, something like this (untested!):

Code: Select all

^\d+(-\d+)?(,\d+(-\d+)?)*$
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Re: Nested regex, 0 or more times

Post by batfastad »

Seems to work perfectly!!
Thanks for the help, will be trying to analyse that to see how it works

Cheers, B
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Nested regex, 0 or more times

Post by prometheuzz »

batfastad wrote:Seems to work perfectly!!
Thanks for the help, will be trying to analyse that to see how it works

Cheers, B
No problem B.

Think about your problem like this:
- you have a list of valid atoms.
- a list can have either one or more atoms in it.
- the atoms are separated by a comma.

Combining these three observation, we would have this:

Code: Select all

pattern : atom
        | atom,atom
        | atom,atom,atom
        | ...
in other words:

Code: Select all

pattern : atom(,atom)*
Now, your atoms can either be a number or a range where a range equals "number-number". So, that would look like:

Code: Select all

atom : number
     | number-number

in short this is:

Code: Select all

atom : number(-number)?
And to combine everything:

Code: Select all

pattern : number(-number)?(,number(-number)?)*
and in regex, this is:

Code: Select all

\d+(-\d+)?(,\d+(-\d+)?)*
That's all there is to it!
Post Reply