matching dynamic text

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
SidewinderX
Forum Contributor
Posts: 407
Joined: Fri Jul 16, 2004 9:04 pm
Location: NY

matching dynamic text

Post by SidewinderX »

I'm trying to match text that can possibly change. The data I am trying to match is inside an html tag:

Code: Select all

<span id="Stats_lbl1">5 days 12 hours 4 minutes</span>
This regex will match the data inside the > <

Code: Select all

<?php
$content = '<span id="Stats_lbl1">5 days 12 hours 4 minutes</span>';
preg_match('#(?<=<span id="Stats_lbl1">).*?(?=</span>)#', $content, $stats);
?>
The problem with that is, I need to know the numbers so I can preform arithmetic on them. *Something* like the regex below will work in the case above:

Code: Select all

preg_match('#(?<=<span id="Stats_lbl1">)(\d+)(.*?)(\d+)(.*?)(\d+)(.*?)(?=</span>)#', $content, $stats);
However, that assumes there are always three numbers separated by three strings, when in reality there are tree cases that can happen:

Code: Select all

//Case 1
$content = '<span id="Stats_lbl1">x days y hours z minutes</span>';
//Case 2
$content = '<span id="Stats_lbl1">y hours z minutes</span>';
//Case 3
$content = '<span id="Stats_lbl1">z minutes</span>';
I'd like a regular expression that can account for the three possibilities. Moreover, [if possible/not necessary] I'd like to have all the numbers to keep their index constant.

Case One should end up like:
$stats[0] = x;
$stats[1] = y;
$stats[2] = z;

Case Two:
$stats[0] = 0;
$stats[1] = y;
$stats[2] = z;

Case Three
$stats[0] = 0;
$stats[1] = 0;
$stats[2] = z;

That way, index 0 will always be the number of days, index 1 will be the number of hours, and index 2 will be the number of minutes. If course that is the ideal situation I'd like, but I wouldn't mind checking the size of the array [if 3 => case 1,;if 2 => case 2; if 1 => case 3]

Thanks
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matching dynamic text

Post by prometheuzz »

"Named capture", the (?P<...> ... ) part, is a handy tool in this case. although it increases the length of your regex, it enhances the clarity, IMO.
Note that the '{ ... }x' will cause the regex engine to ignore the white spaces and new line character which will let you align your regex nicely.

Code: Select all

#!/usr/bin/php
<?php
$tests = array(
    '<span id="Stats_lbl1">5 days 12 hours 4 minutes</span>',
    '<span id="Stats_lbl1">x days y hours z minutes</span>',
    '<span id="Stats_lbl1">y hours z minutes</span>',
    '<span id="Stats_lbl1">z minutes</span>'
);
foreach($tests as $t) {
    $regex = '{
                [^>]+>
                ((?P<DAY> [^\s]+ ) \s+ days    )? \s*
                ((?P<HRS> [^\s]+ ) \s+ hours   )? \s*
                ((?P<MIN> [^\s]+ ) \s+ minutes )? \s* 
                .*
              }x';
    if(preg_match($regex, $t, $matches)) {
        $days    = $matches['DAY'] ? $matches['DAY'] : 0;
        $hours   = $matches['HRS'] ? $matches['HRS'] : 0;
        $minutes = $matches['MIN'] ? $matches['MIN'] : 0;
        print "test=$t\n  days=$days\n  hours=$hours\n  minutes=$minutes\n\n";
    }
}
 
/* output:
 
test=<span id="Stats_lbl1">5 days 12 hours 4 minutes</span>
  days=5
  hours=12
  minutes=4
 
test=<span id="Stats_lbl1">x days y hours z minutes</span>
  days=x
  hours=y
  minutes=z
 
test=<span id="Stats_lbl1">y hours z minutes</span>
  days=0
  hours=y
  minutes=z
 
test=<span id="Stats_lbl1">z minutes</span>
  days=0
  hours=0
  minutes=z
 
*/
?>
HTH.
SidewinderX
Forum Contributor
Posts: 407
Joined: Fri Jul 16, 2004 9:04 pm
Location: NY

Re: matching dynamic text

Post by SidewinderX »

Thanks! That works great. However the entire page content is stored in $matches[0] which is a little more overhead than I would like. If someone knows how to fix that before I can decipher the above regex and attempt to fix it myself, I would appreciate it. I'll post my finished regex, if I get one to work :)

Thanks again
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matching dynamic text

Post by prometheuzz »

SidewinderX wrote:Thanks! That works great. However the entire page content is stored in $matches[0] which is a little more overhead than I would like. If someone knows how to fix that before I can decipher the above regex and attempt to fix it myself, I would appreciate it. I'll post my finished regex, if I get one to work :)

Thanks again
You're welcome.
I've only just started learning PHP's regex, so I don't know how to not store the complete match in [0]. I'll have a look into it. If I manage to find something, I'll post back.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: matching dynamic text

Post by prometheuzz »

SidewinderX wrote:Thanks! That works great. However the entire page content is stored in $matches[0] which is a little more overhead than I would like
...
Well, that was easier than I'd thought. It seems that preg_match does not need to match the entire text (I was under the impression of that). So, you can do something like this:

Code: Select all

#!/usr/bin/php
<?php
$tests = array(
    'some text not to store <span id="Stats_lbl1">5 DAYS 12 hours 4 minutes</span> some more text not to store ',
    'some text not to store <span id="Stats_lbl1">x days y Hours z minutes</span> some more text not to store ',
    'some text not to store <span id="Stats_lbl1">y hours z minutes</span> some more text not to store ',
    'some text not to store <span id="Stats_lbl1">z minutEs</span> some more text not to store '
);
foreach($tests as $t) {
    $regex = '{
            (?: ([^\s>]+) \s+ days      \s+ )?
            (?: ([^\s>]+) \s+ hours     \s+ )?
                ([^\s>]+) \s+ minutes
        }xi';
    if(preg_match($regex, $t, $matches)) {
        $days    = $matches[1] ? $matches[1] : 0;
        $hours   = $matches[2] ? $matches[2] : 0;
        $minutes = $matches[3] ? $matches[3] : 0;
        print "text = $t\n  entire match = $matches[0]\n  days = $days\n  hours = $hours\n  minutes = $minutes\n\n";
    }
}
 
/* output:
 
text = some text not to store <span id="Stats_lbl1">5 DAYS 12 hours 4 minutes</span> some more text not to store 
  entire match = 5 DAYS 12 hours 4 minutes
  days = 5
  hours = 12
  minutes = 4
 
text = some text not to store <span id="Stats_lbl1">x days y Hours z minutes</span> some more text not to store 
  entire match = x days y Hours z minutes
  days = x
  hours = y
  minutes = z
 
text = some text not to store <span id="Stats_lbl1">y hours z minutes</span> some more text not to store 
  entire match = y hours z minutes
  days = 0
  hours = y
  minutes = z
 
text = some text not to store <span id="Stats_lbl1">z minutEs</span> some more text not to store 
  entire match = z minutEs
  days = 0
  hours = 0
  minutes = z
 
*/
?>
Note that besides the x-flag, I used an i-flag which make no difference between upper- and lower case characters.

HTH.
Post Reply