faster => straight text or text expression?
Moderator: General Moderators
faster => straight text or text expression?
I'm working on a project that scans email looking for key words. As we have a lot of key words I was wondering which is faster -- individual key words or a key word expression.
Here is an example:
\bcompensate\b
\bcompensation\b
or
\bcompensat[e|ion]\b
Thanks
Here is an example:
\bcompensate\b
\bcompensation\b
or
\bcompensat[e|ion]\b
Thanks
- raghavan20
- DevNet Resident
- Posts: 1451
- Joined: Sat Jun 11, 2005 6:57 am
- Location: London, UK
- Contact:
please run the test and report the result and I am interested to know the result.bryanchar wrote:Since we are scanning 200,000 per day real time I wanted to make sure we squeezed every last advantage from the valuation of the expression.
Sorry about the logic error. More like \bcompensat(e|ion)\b.
Thanks
I have a doubt when you put
\bsomeword(one|two)\b
will the regex work like this...
1. first searches for someword
2. if found, then looks for trailing word, one or two
or
1. it by nature, splits the word into \bsomewordone\b and \bsomewordtwo\b
2. searches for first match string, then for second match string and so on...
if the second way is how regex works, then I think you should use your own method of looping that would overhead for the regex to split it up....
Since the regex engine is a black box to me I won't have a way to know the outcome at a detailed level. I was hoping that there was a general observation about this issue so we could deliver our expressions file optimized to get the best performance.
Is the question any different than if we were looking for \basset(s)?\b
To me they represent the same question. Is \basset\b and \bassets\b faster than \basset(s)?\b.
Thank for your thoughts.
Is the question any different than if we were looking for \basset(s)?\b
To me they represent the same question. Is \basset\b and \bassets\b faster than \basset(s)?\b.
Thank for your thoughts.
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
the only way to really test is with a large volume of text. So I'll do some tests here:
Code: Select all
<?php
set_time_limit(0);
$file = 'DavidCopperfield.txt';
$content = file_get_contents($file);
$iterations = 100;
$times = array();
$sets = array(
array('#\badvertis(ement|ing)\b#i'),
array('#\badvertis(?:ement|ing)\b#i'),
array('#\badvertisement\b#i','#\badvertising\b#i'),
array('#\badvertisements?\b#i'),
array('#\badvertisement\b#i','#\badvertisements\b#i'),
);
$i = 0;
$start = 0;
$stop = 0;
$count = 0;
foreach($sets as $set => $patterns)
{
$times[$set] = array();
foreach($patterns as $pattern)
{
$times[$set][$pattern] = array();
$times[$set][$pattern]['count'] = 0;
$times[$set][$pattern]['time'] = 0;
$times[$set][$pattern]['matches'] = null;
for($i = 0; $i < $iterations; $i++)
{
$start = microtime(true);
$count = preg_match_all($pattern,$content,$matches);
$stop = microtime(true);
$times[$set][$pattern]['count'] = $count;
$times[$set][$pattern]['time'] = bcadd($times[$set][$pattern]['time'],$stop - $start,20);
$times[$set][$pattern]['matches'] = $matches;
unset($matches);
}
}
}
var_export($times);
?>Code: Select all
array (
0 =>
array (
'#\\badvertis(ement|ing)\\b#i' =>
array (
'count' => 8,
'time' => '6.69298410415649300000',
'matches' =>
array (
0 =>
array (
0 => 'advertisement',
1 => 'advertisement',
2 => 'advertisement',
3 => 'advertising',
4 => 'Advertising',
5 => 'advertisement',
6 => 'advertisement',
7 => 'advertisement',
),
1 =>
array (
0 => 'ement',
1 => 'ement',
2 => 'ement',
3 => 'ing',
4 => 'ing',
5 => 'ement',
6 => 'ement',
7 => 'ement',
),
),
),
),
1 =>
array (
'#\\badvertis(?:ement|ing)\\b#i' =>
array (
'count' => 8,
'time' => '6.62231302261352600000',
'matches' =>
array (
0 =>
array (
0 => 'advertisement',
1 => 'advertisement',
2 => 'advertisement',
3 => 'advertising',
4 => 'Advertising',
5 => 'advertisement',
6 => 'advertisement',
7 => 'advertisement',
),
),
),
),
2 =>
array (
'#\\badvertisement\\b#i' =>
array (
'count' => 6,
'time' => '6.63905668258666700000',
'matches' =>
array (
0 =>
array (
0 => 'advertisement',
1 => 'advertisement',
2 => 'advertisement',
3 => 'advertisement',
4 => 'advertisement',
5 => 'advertisement',
),
),
),
'#\\badvertising\\b#i' =>
array (
'count' => 2,
'time' => '6.57100462913513500000',
'matches' =>
array (
0 =>
array (
0 => 'advertising',
1 => 'Advertising',
),
),
),
),
3 =>
array (
'#\\badvertisements?\\b#i' =>
array (
'count' => 7,
'time' => '6.63881516456603600000',
'matches' =>
array (
0 =>
array (
0 => 'advertisement',
1 => 'advertisements',
2 => 'advertisement',
3 => 'advertisement',
4 => 'advertisement',
5 => 'advertisement',
6 => 'advertisement',
),
),
),
),
4 =>
array (
'#\\badvertisement\\b#i' =>
array (
'count' => 6,
'time' => '6.57747244834900200000',
'matches' =>
array (
0 =>
array (
0 => 'advertisement',
1 => 'advertisement',
2 => 'advertisement',
3 => 'advertisement',
4 => 'advertisement',
5 => 'advertisement',
),
),
),
'#\\badvertisements\\b#i' =>
array (
'count' => 1,
'time' => '6.57750916481017900000',
'matches' =>
array (
0 =>
array (
0 => 'advertisements',
),
),
),
),
)- raghavan20
- DevNet Resident
- Posts: 1451
- Joined: Sat Jun 11, 2005 6:57 am
- Location: London, UK
- Contact:
that is a good example...I need a clarification...
what is the difference between
and
what is the difference between
Code: Select all
/Foo(?:bar)+/Code: Select all
/Foo(bar)+/- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
the first tells regex to not remember the value found inside, it's simply a grouping, while the second tells regex to remember the found match.raghavan20 wrote:that is a good example...I need a clarification...
what is the difference betweenandCode: Select all
/Foo(?:bar)+/Code: Select all
/Foo(bar)+/
The difference is actually significant. The results show that two separate regex took 13 seconds (there are two elements in the third array (2)) to execute while a single regex form only took 7 seconds. Running over 2MB of data, 100 times, the real difference isn't too bad, but every little bit of time shaved it important when you're talking about lots and lots of matching.bryanchar wrote:Sorry to ask this but what does the data indicate? The numbers seem to indicate no statistical difference. Am I interpreting this correctly?
Additionally, telling the regex engine to remember the group which was found was slower than not remembering. This may be attributed to the time requirements to allocate the additional array entries. But again, every little bit of time saved is important.
Now, another thing to test is whether combining all of the keywords together into a single regex is more efficient than separately. I would bet it is. The difference between compact regex (combining words where possible) as well may yield a benefit, but could also hurt as the regex engine has to pay attention to more.