Stuck on regex. Could use some help.

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
northstar7
Forum Newbie
Posts: 6
Joined: Tue Mar 23, 2010 7:15 pm

Stuck on regex. Could use some help.

Post by northstar7 »

I'm trying to develop a regex expression that will parse the following phrases.

$text ="[view:news_events_location_listing=page_9=2-miles=items:5]";
$text ="[view:news_events_location_listing=page_9=2-miles]";

The regex has to parse the two strings and load them into an array as follows:

Array [0][0] is [view:news_events_location_listing=page_9=2-miles=items:5]
Array [1][0] is news_events_location_listing
Array [2][0] is page_9
Array [3][0] is 2-miles
Array [4][0] is 5

This works fine for the first string:

preg_match_all("/\[view:([^=\]]+)=?([^=\]]+)?=?([^\]]*)?=items:=?([0-9])\]/i",$text, $match);

But I can't figure out how to write it so that it will produce the same output from the shorter $text string. I want to skip the Array [4][0] entirely in the case of the shorter string so I get this:

Array [0][0] is [view:news_events_location_listing=page_9=2-miles=items:5]
Array [1][0] is news_events_location_listing
Array [2][0] is page_9
Array [3][0] is 2-miles

I'd be very grateful for any help on this.

Thanx
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Stuck on regex. Could use some help.

Post by ridgerunner »

Try this one:

Code: Select all

preg_match_all('/\[view:([^=\]]+)=([^=\]]+)=([^=\]]+)(?:=items:([0-9]+))?\]/i', $contents, $matches);
It matches the very limited test data you have provided. My hunch is that your actual data has more variability which will need to be accounted for.
:)
northstar7
Forum Newbie
Posts: 6
Joined: Tue Mar 23, 2010 7:15 pm

Re: Stuck on regex. Could use some help.

Post by northstar7 »

Hi. That's really fantastic. Thank you very much.

Actually, I think the incoming strings will be as uniform as I outlined.

I see you changed from this:

preg_match_all("/\[view:([^=\]]+)=?([^=\]]+)?=?([^\]]*)?=items:=?([0-9])\]/i",$text, $match);

to this:

preg_match_all('/\[view:([^=\]]+)=([^=\]]+)=([^=\]]+)(?:=items:([0-9]+))?\]/i', $text, $match);

I'd be grateful if you could explain the effective different between what I had:

=items:=?([0-9])\]

and what you wrote:

:=items:([0-9]+))?\]

Thanks a lot.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Stuck on regex. Could use some help.

Post by ridgerunner »

northstar7 wrote:... I'd be grateful if you could explain the effective different between what I had:

=items:=?([0-9])\]

and what you wrote:

:=items:([0-9]+))?\]

Thanks a lot.
Actually, you cut off the beginning of the non-capturing parenthesis which I added to the last section (to make it optional). To explain, here is the whole thing in free-spacing long form with comments...

Code: Select all

$re = '/
    \[view:       # match opening literal text
    ( [^=\]]+ )   # capture "news_events_location_listing" into group 1
    =             # match literal =
    ( [^=\]]+ )   # capture "page_9" into group 2
    =             # match literal =
    ( [^=\]]+ )   # capture "2-miles" into group 3
    (?:           # begin non-capture group (to apply ? quantifier)
      =items:     # match literal beginning of items part
      ( [0-9]+ )  # capture items digit(s) into group 4
    )?            # end non-capture group and make it optional
    \]            # match closing literal text
    /ix';
preg_match_all($re, $contents, $matches);
Note that everything in the regex is required to match except for the '=items:5' towards the end. This part is made optional with the addition of the '?' zero-or-one quantifier immediately following the non-capturing parenthesis - i.e. '(?: ... )?'.

Your regex: '=items:=?([0-9])\]' has the '?' quantifier applied to the equals sign which makes that one character optional, however, everything else in that sub-expression is still required to match. Do you see the difference now?

Hope this helps :)
northstar7
Forum Newbie
Posts: 6
Joined: Tue Mar 23, 2010 7:15 pm

Re: Stuck on regex. Could use some help.

Post by northstar7 »

Yes, that helps a lot. I've looked at a lot of online tutorials and guides, but you provided the clearest explanation of what's going on in a regex that I've seen.

You should write a book.

Thanks again
northstar7
Forum Newbie
Posts: 6
Joined: Tue Mar 23, 2010 7:15 pm

Re: Stuck on regex. Could use some help.

Post by northstar7 »

It's surprising to me, but not to you, I'm sure, that the people I've been working with on this regex just emailed me about it.

To quote your earlier post:

"My hunch is that your actual data has more variability which will need to be accounted for."

And that's exactly what they said! I tried not to flame in my reply but I did ask how they expected a regex to work if they didn't fully define the range of input. It turns out that they want all of the backreferences to be optional -- not too bad -- but all I said about the incoming strings being uniform was completely wrong.

When I get the new information about the new strings I'll use your free-spacing long form and see how close I can get to a comprehensive regex. I have a suspicion I'll be back on here looking for your advice!

Thanks again
northstar7
Forum Newbie
Posts: 6
Joined: Tue Mar 23, 2010 7:15 pm

Re: Stuck on regex. Could use some help.

Post by northstar7 »

Hi. I don't know if anyone is still reading this thread, but I've got one last item to fit in.

Specifically, I have the use cases for this Regex:

1. [view:view_name]

2. [view:view_name=view_display]

3. [view:view_name=arguments]

4. [view:view_name=items:2]

5. [view:view_name=arguments=items:2]

6. [view:view_name=view_display=items:2]

7. [view:view_name=view_display=arguments]

8. [view:view_name=view_display=arguments=items:2]

For each of them I need to capture each phrase after the first "[view:" into a backreference. The different phrases are separated by = signs. As before, for the items:2 phrase I need to backcapture (is that a word?) only the numeral.

I have been playing around with it and I came up with

\[view:([^=\]]+)?=?([^items=\]]+)??=?([^items=\]]+)?(?:=items:([0-9]+))?\]

The ^items needs to be added so that it's not captured when there are only two or three phrases to be captured. Unfortunately, using it the way I have it knocks out several of the phrases I want to keep. For example, this won't work with case 3, 5, 6, 7.

I thought that ?!items:^=\] might work, but it doesn't.

Do you have any thoughts about how I can do this?

Thanks again.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Stuck on regex. Could use some help.

Post by ridgerunner »

Its getting a bit too complicated to handle with a single regex. Here's how I would handle it...

Code: Select all

<?php
// regex to split data string
$re = '/
^\[view:view_name  # split on either starting literal text
|                  # or...
=                  # an equals sign param separator
|                  # or...
\]$                # the ending literal text
/x';

// test data (array of strings)
$data = array(
    '[view:view_name]',
    '[view:view_name=view_display]',
    '[view:view_name=arguments]',
    '[view:view_name=items:2]',
    '[view:view_name=arguments=items:2]',
    '[view:view_name=view_display=items:2]',
    '[view:view_name=view_display=arguments]',
    '[view:view_name=view_display=arguments=items:2]');
$ndata = count($data);
$results = array();
for ($i = 0; $i < $ndata; $i++) {
    // handle each input data string by splitting up
    $results = preg_split($re, $data[$i], -1, PREG_SPLIT_NO_EMPTY);
    $nresults = count($results);
    echo(sprintf("\nData set number %d has %d parameters:\n", $i + 1, $nresults));
    for ($j = 0; $j < $nresults; $j++) {
        // handle each parameter within this data string
        echo(sprintf("  param[%d] = \"%s\"", $j + 1, $results[$j]));
        if (preg_match('/^items:(\d+)$/', $results[$j], $matches)) {
            echo(sprintf(" (Note: this param has count = %d)\n", $matches[1]));
        } else {
            echo("\n");
        }
    }
}
?>
Hope this helps! :)
Last edited by ridgerunner on Sat Apr 17, 2010 5:43 pm, edited 1 time in total.
northstar7
Forum Newbie
Posts: 6
Joined: Tue Mar 23, 2010 7:15 pm

Re: Stuck on regex. Could use some help.

Post by northstar7 »

Hi, Ridgerunner. As usual, thanks for all your help.

I ran your script and got the following result (after throwing in a <br/>).

Data set number 1 has 0 parameters:
Data set number 2 has 1 parameters:
param[1] = "view_display" Data set number 3 has 1 parameters:
param[1] = "arguments" Data set number 4 has 1 parameters:
param[1] = "items:2" (Note: this param has count = 2) Data set number 5 has 2 parameters:
param[1] = "arguments" param[2] = "items:2" (Note: this param has count = 2) Data set number 6 has 2 parameters:
param[1] = "view_display" param[2] = "items:2" (Note: this param has count = 2) Data set number 7 has 2 parameters:
param[1] = "view_display" param[2] = "arguments" Data set number 8 has 3 parameters:
param[1] = "view_display" param[2] = "arguments" param[3] = "items:2" (Note: this param has count = 2)

It looks like just what I need except for one sticking point: in lines 5, 6, 7 and 9 the result includes "items:2" when the desired result is "2". I looked at your code and I still have no idea how to exclude the script from picking up "items:".

Sorry to keep coming back to you on this.

Thanks
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Stuck on regex. Could use some help.

Post by ridgerunner »

Previous scripts were designed to be run from the command line, but I guess you are running this through a webserver. Here is another version that strips out the "items:" part (if its there): The output is now formatted in HTML so it should look ok in a browser... :)

Code: Select all

<?php
// regex to split data string
$re = '/
^\[view:view_name  # split on either starting literal text
|                  # or...
=(?:items:)?       # an equals sign (with "items:" if there)
|                  # or...
\]$                # the ending literal text
/x';

// test data (array of strings)
$data = array(
    '[view:view_name]',
    '[view:view_name=view_display]',
    '[view:view_name=arguments]',
    '[view:view_name=items:2]',
    '[view:view_name=arguments=items:2]',
    '[view:view_name=view_display=items:2]',
    '[view:view_name=view_display=arguments]',
    '[view:view_name=view_display=arguments=items:2]');
$ndata = count($data);
$results = array();
echo("<html><head><title>test.php</title>\n" .
    "<style type=\"text/css\" media=\"all\">\n" .
    "\tbody {margin: 2em; color:#333; background:#DDB; font-family: monospace;}\n" .
    "\tdd {white-space: pre;}\n" .
    "</style></head><body>\n");
echo(sprintf("<h1>test.php - %d data sets</h1>\n", $ndata));
for ($i = 0; $i < $ndata; $i++) {
    // handle each input data string by splitting up
    $results = preg_split($re, $data[$i], -1, PREG_SPLIT_NO_EMPTY);
    $nresults = count($results);
    echo(sprintf("<dl>\t<dt>Data set number %d has %d parameters:</dt>\n", $i + 1, $nresults));
    echo(sprintf(    "\t<dd>string   = \"%s\"</dd>\n", $data[$i]));
    for ($j = 0; $j < $nresults; $j++) {
        // handle each parameter within this data string
        echo(sprintf("\t<dd>param[%d] = \"%s\"</dd>\n", $j + 1, $results[$j]));
    }
    echo("</dl>\n");
}
echo("</body></html>\n");
?>
Post Reply