Page 1 of 1

faster => straight text or text expression?

Posted: Sat Feb 11, 2006 10:35 am
by bryanchar
I'm working on a project that scans email looking for key words. As we have a lot of key words I was wondering which is faster -- individual key words or a key word expression.

Here is an example:

\bcompensate\b
\bcompensation\b

or

\bcompensat[e|ion]\b

Thanks

Posted: Sat Feb 11, 2006 10:42 am
by feyd
ignoring your logic errors in your expression, test it. Your performance may vary from that of which we've seen. However, I would guess that the expression will win, but only slightly unless you are using ereg_* in which case, I would bet expression wins hands down.

Posted: Sat Feb 11, 2006 11:04 am
by bryanchar
Since we are scanning 200,000 per day real time I wanted to make sure we squeezed every last advantage from the valuation of the expression.

Sorry about the logic error. More like \bcompensat(e|ion)\b.

Thanks

Posted: Sat Feb 11, 2006 12:47 pm
by raghavan20
bryanchar wrote:Since we are scanning 200,000 per day real time I wanted to make sure we squeezed every last advantage from the valuation of the expression.

Sorry about the logic error. More like \bcompensat(e|ion)\b.

Thanks
please run the test and report the result and I am interested to know the result.
I have a doubt when you put
\bsomeword(one|two)\b
will the regex work like this...
1. first searches for someword
2. if found, then looks for trailing word, one or two

or

1. it by nature, splits the word into \bsomewordone\b and \bsomewordtwo\b
2. searches for first match string, then for second match string and so on...
if the second way is how regex works, then I think you should use your own method of looping that would overhead for the regex to split it up....

Posted: Sat Feb 11, 2006 12:58 pm
by feyd
it sorta of breaks it into your second set raghavan20. Although instead of searching for one then the other, it allows both to happen at any rate.

Posted: Sat Feb 11, 2006 4:53 pm
by bryanchar
Since the regex engine is a black box to me I won't have a way to know the outcome at a detailed level. I was hoping that there was a general observation about this issue so we could deliver our expressions file optimized to get the best performance.

Is the question any different than if we were looking for \basset(s)?\b

To me they represent the same question. Is \basset\b and \bassets\b faster than \basset(s)?\b.

Thank for your thoughts.

Posted: Sat Feb 11, 2006 5:27 pm
by feyd
the only way to really test is with a large volume of text. So I'll do some tests here:

Code: Select all

<?php

set_time_limit(0);

$file = 'DavidCopperfield.txt';

$content = file_get_contents($file);

$iterations = 100;

$times = array();

$sets = array(
	array('#\badvertis(ement|ing)\b#i'),
	array('#\badvertis(?:ement|ing)\b#i'),
	array('#\badvertisement\b#i','#\badvertising\b#i'),
	array('#\badvertisements?\b#i'),
	array('#\badvertisement\b#i','#\badvertisements\b#i'),
);

$i = 0;
$start = 0;
$stop = 0;
$count = 0;
foreach($sets as $set => $patterns)
{
	$times[$set] = array();
	foreach($patterns as $pattern)
	{
		$times[$set][$pattern] = array();
		$times[$set][$pattern]['count'] = 0;
		$times[$set][$pattern]['time'] = 0;
		$times[$set][$pattern]['matches'] = null;
		for($i = 0; $i < $iterations; $i++)
		{
			$start = microtime(true);
			$count = preg_match_all($pattern,$content,$matches);
			$stop = microtime(true);
			$times[$set][$pattern]['count'] = $count;
			$times[$set][$pattern]['time'] = bcadd($times[$set][$pattern]['time'],$stop - $start,20);
			$times[$set][$pattern]['matches'] = $matches;
			unset($matches);
		}
	}
}

var_export($times);

?>

Code: Select all

array (
  0 => 
  array (
    '#\\badvertis(ement|ing)\\b#i' => 
    array (
      'count' => 8,
      'time' => '6.69298410415649300000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertising',
          4 => 'Advertising',
          5 => 'advertisement',
          6 => 'advertisement',
          7 => 'advertisement',
        ),
        1 => 
        array (
          0 => 'ement',
          1 => 'ement',
          2 => 'ement',
          3 => 'ing',
          4 => 'ing',
          5 => 'ement',
          6 => 'ement',
          7 => 'ement',
        ),
      ),
    ),
  ),
  1 => 
  array (
    '#\\badvertis(?:ement|ing)\\b#i' => 
    array (
      'count' => 8,
      'time' => '6.62231302261352600000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertising',
          4 => 'Advertising',
          5 => 'advertisement',
          6 => 'advertisement',
          7 => 'advertisement',
        ),
      ),
    ),
  ),
  2 => 
  array (
    '#\\badvertisement\\b#i' => 
    array (
      'count' => 6,
      'time' => '6.63905668258666700000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertisement',
          4 => 'advertisement',
          5 => 'advertisement',
        ),
      ),
    ),
    '#\\badvertising\\b#i' => 
    array (
      'count' => 2,
      'time' => '6.57100462913513500000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertising',
          1 => 'Advertising',
        ),
      ),
    ),
  ),
  3 => 
  array (
    '#\\badvertisements?\\b#i' => 
    array (
      'count' => 7,
      'time' => '6.63881516456603600000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisements',
          2 => 'advertisement',
          3 => 'advertisement',
          4 => 'advertisement',
          5 => 'advertisement',
          6 => 'advertisement',
        ),
      ),
    ),
  ),
  4 => 
  array (
    '#\\badvertisement\\b#i' => 
    array (
      'count' => 6,
      'time' => '6.57747244834900200000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertisement',
          4 => 'advertisement',
          5 => 'advertisement',
        ),
      ),
    ),
    '#\\badvertisements\\b#i' => 
    array (
      'count' => 1,
      'time' => '6.57750916481017900000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisements',
        ),
      ),
    ),
  ),
)

Posted: Sat Feb 11, 2006 7:27 pm
by raghavan20
that is a good example...I need a clarification...
what is the difference between

Code: Select all

/Foo(?:bar)+/
and

Code: Select all

/Foo(bar)+/

Posted: Sat Feb 11, 2006 7:45 pm
by bryanchar
Sorry to ask this but what does the data indicate? The numbers seem to indicate no statistical difference. Am I interpreting this correctly?

Posted: Sat Feb 11, 2006 8:21 pm
by feyd
raghavan20 wrote:that is a good example...I need a clarification...
what is the difference between

Code: Select all

/Foo(?:bar)+/
and

Code: Select all

/Foo(bar)+/
the first tells regex to not remember the value found inside, it's simply a grouping, while the second tells regex to remember the found match.
bryanchar wrote:Sorry to ask this but what does the data indicate? The numbers seem to indicate no statistical difference. Am I interpreting this correctly?
The difference is actually significant. The results show that two separate regex took 13 seconds (there are two elements in the third array (2)) to execute while a single regex form only took 7 seconds. Running over 2MB of data, 100 times, the real difference isn't too bad, but every little bit of time shaved it important when you're talking about lots and lots of matching.

Additionally, telling the regex engine to remember the group which was found was slower than not remembering. This may be attributed to the time requirements to allocate the additional array entries. But again, every little bit of time saved is important.

Now, another thing to test is whether combining all of the keywords together into a single regex is more efficient than separately. I would bet it is. The difference between compact regex (combining words where possible) as well may yield a benefit, but could also hurt as the regex engine has to pay attention to more.

Posted: Sat Feb 11, 2006 8:42 pm
by bryanchar
This has been fascinating. Our lexicon has over 470 keywords. While it makes it harder to read I'm going to combine keywords where it makes sense. I'm certain this effort will have a positive impact on processing performance.

Thanks so much for entertaining a newbie question!!

Bryan C.