faster => straight text or text expression?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
bryanchar
Forum Newbie
Posts: 5
Joined: Sat Feb 11, 2006 10:30 am

faster => straight text or text expression?

Post by bryanchar »

I'm working on a project that scans email looking for key words. As we have a lot of key words I was wondering which is faster -- individual key words or a key word expression.

Here is an example:

\bcompensate\b
\bcompensation\b

or

\bcompensat[e|ion]\b

Thanks
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

ignoring your logic errors in your expression, test it. Your performance may vary from that of which we've seen. However, I would guess that the expression will win, but only slightly unless you are using ereg_* in which case, I would bet expression wins hands down.
bryanchar
Forum Newbie
Posts: 5
Joined: Sat Feb 11, 2006 10:30 am

Post by bryanchar »

Since we are scanning 200,000 per day real time I wanted to make sure we squeezed every last advantage from the valuation of the expression.

Sorry about the logic error. More like \bcompensat(e|ion)\b.

Thanks
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

bryanchar wrote:Since we are scanning 200,000 per day real time I wanted to make sure we squeezed every last advantage from the valuation of the expression.

Sorry about the logic error. More like \bcompensat(e|ion)\b.

Thanks
please run the test and report the result and I am interested to know the result.
I have a doubt when you put
\bsomeword(one|two)\b
will the regex work like this...
1. first searches for someword
2. if found, then looks for trailing word, one or two

or

1. it by nature, splits the word into \bsomewordone\b and \bsomewordtwo\b
2. searches for first match string, then for second match string and so on...
if the second way is how regex works, then I think you should use your own method of looping that would overhead for the regex to split it up....
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

it sorta of breaks it into your second set raghavan20. Although instead of searching for one then the other, it allows both to happen at any rate.
bryanchar
Forum Newbie
Posts: 5
Joined: Sat Feb 11, 2006 10:30 am

Post by bryanchar »

Since the regex engine is a black box to me I won't have a way to know the outcome at a detailed level. I was hoping that there was a general observation about this issue so we could deliver our expressions file optimized to get the best performance.

Is the question any different than if we were looking for \basset(s)?\b

To me they represent the same question. Is \basset\b and \bassets\b faster than \basset(s)?\b.

Thank for your thoughts.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

the only way to really test is with a large volume of text. So I'll do some tests here:

Code: Select all

<?php

set_time_limit(0);

$file = 'DavidCopperfield.txt';

$content = file_get_contents($file);

$iterations = 100;

$times = array();

$sets = array(
	array('#\badvertis(ement|ing)\b#i'),
	array('#\badvertis(?:ement|ing)\b#i'),
	array('#\badvertisement\b#i','#\badvertising\b#i'),
	array('#\badvertisements?\b#i'),
	array('#\badvertisement\b#i','#\badvertisements\b#i'),
);

$i = 0;
$start = 0;
$stop = 0;
$count = 0;
foreach($sets as $set => $patterns)
{
	$times[$set] = array();
	foreach($patterns as $pattern)
	{
		$times[$set][$pattern] = array();
		$times[$set][$pattern]['count'] = 0;
		$times[$set][$pattern]['time'] = 0;
		$times[$set][$pattern]['matches'] = null;
		for($i = 0; $i < $iterations; $i++)
		{
			$start = microtime(true);
			$count = preg_match_all($pattern,$content,$matches);
			$stop = microtime(true);
			$times[$set][$pattern]['count'] = $count;
			$times[$set][$pattern]['time'] = bcadd($times[$set][$pattern]['time'],$stop - $start,20);
			$times[$set][$pattern]['matches'] = $matches;
			unset($matches);
		}
	}
}

var_export($times);

?>

Code: Select all

array (
  0 => 
  array (
    '#\\badvertis(ement|ing)\\b#i' => 
    array (
      'count' => 8,
      'time' => '6.69298410415649300000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertising',
          4 => 'Advertising',
          5 => 'advertisement',
          6 => 'advertisement',
          7 => 'advertisement',
        ),
        1 => 
        array (
          0 => 'ement',
          1 => 'ement',
          2 => 'ement',
          3 => 'ing',
          4 => 'ing',
          5 => 'ement',
          6 => 'ement',
          7 => 'ement',
        ),
      ),
    ),
  ),
  1 => 
  array (
    '#\\badvertis(?:ement|ing)\\b#i' => 
    array (
      'count' => 8,
      'time' => '6.62231302261352600000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertising',
          4 => 'Advertising',
          5 => 'advertisement',
          6 => 'advertisement',
          7 => 'advertisement',
        ),
      ),
    ),
  ),
  2 => 
  array (
    '#\\badvertisement\\b#i' => 
    array (
      'count' => 6,
      'time' => '6.63905668258666700000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertisement',
          4 => 'advertisement',
          5 => 'advertisement',
        ),
      ),
    ),
    '#\\badvertising\\b#i' => 
    array (
      'count' => 2,
      'time' => '6.57100462913513500000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertising',
          1 => 'Advertising',
        ),
      ),
    ),
  ),
  3 => 
  array (
    '#\\badvertisements?\\b#i' => 
    array (
      'count' => 7,
      'time' => '6.63881516456603600000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisements',
          2 => 'advertisement',
          3 => 'advertisement',
          4 => 'advertisement',
          5 => 'advertisement',
          6 => 'advertisement',
        ),
      ),
    ),
  ),
  4 => 
  array (
    '#\\badvertisement\\b#i' => 
    array (
      'count' => 6,
      'time' => '6.57747244834900200000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisement',
          1 => 'advertisement',
          2 => 'advertisement',
          3 => 'advertisement',
          4 => 'advertisement',
          5 => 'advertisement',
        ),
      ),
    ),
    '#\\badvertisements\\b#i' => 
    array (
      'count' => 1,
      'time' => '6.57750916481017900000',
      'matches' => 
      array (
        0 => 
        array (
          0 => 'advertisements',
        ),
      ),
    ),
  ),
)
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

that is a good example...I need a clarification...
what is the difference between

Code: Select all

/Foo(?:bar)+/
and

Code: Select all

/Foo(bar)+/
bryanchar
Forum Newbie
Posts: 5
Joined: Sat Feb 11, 2006 10:30 am

Post by bryanchar »

Sorry to ask this but what does the data indicate? The numbers seem to indicate no statistical difference. Am I interpreting this correctly?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

raghavan20 wrote:that is a good example...I need a clarification...
what is the difference between

Code: Select all

/Foo(?:bar)+/
and

Code: Select all

/Foo(bar)+/
the first tells regex to not remember the value found inside, it's simply a grouping, while the second tells regex to remember the found match.
bryanchar wrote:Sorry to ask this but what does the data indicate? The numbers seem to indicate no statistical difference. Am I interpreting this correctly?
The difference is actually significant. The results show that two separate regex took 13 seconds (there are two elements in the third array (2)) to execute while a single regex form only took 7 seconds. Running over 2MB of data, 100 times, the real difference isn't too bad, but every little bit of time shaved it important when you're talking about lots and lots of matching.

Additionally, telling the regex engine to remember the group which was found was slower than not remembering. This may be attributed to the time requirements to allocate the additional array entries. But again, every little bit of time saved is important.

Now, another thing to test is whether combining all of the keywords together into a single regex is more efficient than separately. I would bet it is. The difference between compact regex (combining words where possible) as well may yield a benefit, but could also hurt as the regex engine has to pay attention to more.
bryanchar
Forum Newbie
Posts: 5
Joined: Sat Feb 11, 2006 10:30 am

Post by bryanchar »

This has been fascinating. Our lexicon has over 470 keywords. While it makes it harder to read I'm going to combine keywords where it makes sense. I'm certain this effort will have a positive impact on processing performance.

Thanks so much for entertaining a newbie question!!

Bryan C.
Post Reply