Page 1 of 1

Regular expression performance: perl vs. PHP

Posted: Thu Aug 28, 2003 4:35 pm
by tridude
I've got an application using regular expressions that allows users to query some giant log files and spit out lines matching their search criteriato to the browser . I wanted to use PHP for the whole thing, but have found that using perl results in much, much faster results.

My question: has anyone else has seen a noticeable performance difference between PHP and perl when it comes to using regular expressions? If it makes a difference, I'm not using non-greedy matching in my perl regular expressions.

Thanks in advance for your reply.

Posted: Thu Aug 28, 2003 6:42 pm
by BDKR
I've not been in a position to see any potential differences, but I've heard from a number of different 'notables' that Perl's string handling is much faster. Perhaps you can post your PHP code and some of us can see if there is an area where it can be sped up? Maybe something you overlooked?

Cheers,
BDKR

Posted: Thu Aug 28, 2003 9:27 pm
by tridude
Here's a sample of the log file(s) being searched:

Code: Select all

2-11-2003 mgm20021231.zip o ghptemp
2-11-2003 mgm20021231.zip i helter
2-11-2003 hend20030131.zip i ratkinso
2-11-2003 mlc20021231.zip o litmus
2-11-2003 mlc20021231.zip i ghptemp
2-11-2003 MARTINCURRIE20021231.zip o imaintrl
2-11-2003 Marlborough20021231.zip o imaintrl
2-11-2003 LeggMasonInvestors20021231.zip o joe
Here's a stripped down version of the php regex code:

Code: Select all

<?php
// set file to read
$filename = "FILE.log";

// open file
$fh = fopen ($filename, "r") or die("Could not open file");

// read file
while (!feof($fh)) {
    $data = fgets($fh);
    if (preg_match("/akotecha/", $data)) {
        echo $data;
    }
}
// close file
fclose ($fh);
?>

And in perl, I'm using this:

Code: Select all

#!/usr/bin/perl

my $file = "FTP.log";

open FH, $file;

while (&lt;FH&gt;) &#123;
    print if /akotecha/;
&#125;
The php script takes about .822 seconds to return 400 lines while the perl script can do it in .181.

I'm guessing there might be a better way of doing the PHP script as it seems like there's most likely some overhead associated with checking for the end of the file after each line is read. However, I haven't fiugred out how to eliminate this. Maybe this really isn't a regex issue at all...

Posted: Fri Aug 29, 2003 7:12 am
by BDKR
This should work too. I can't say with certainty that there will be a speed up, but it's worth a try.

Code: Select all

<?php
// Open the file into an array
$fh = file("FILE.log");

// Read file
foreach($fh as $data) {
    if(stristr($data, 'akotecha') {
        echo $data."\n";
    }
}

?>
The way you were using preg_match looked as though you were simply looking to see if a str was contained in a larger string. That said, I used stristr(). I hope I didn't read that portion of the code wrong. :cry:

We also have the line count down to something closer to your Perl example. Even though that alone shouldn't make too much of a difference.

I won't be back in town until Monday, but I want to hear how this works out.

Cheers,
BDKR

Posted: Fri Aug 29, 2003 1:01 pm
by m3rajk
umm


php's preg functions actually use perl.

do you mean posix v perl?????

if so perl is much more robust, and some eople claim it's faster, others that the posix is faster, all i know for sure is that perl is capable of much more. php itself has no regexp. it either calls perl regexp or posix., depending on the function you use

Posted: Fri Aug 29, 2003 4:35 pm
by Rook
my 2 cents....

Actually... Perl and PHP do not use the same engine for regular expressions. Perl Has it's own. PHP uses the PCRE engine that of course is based on Perl's Engine. Lot's of other languages use this PCRE engine and they have a LOT of similarities, but they are in fact two different regex engines.

Perl vs. PCRE

Perl can definitely be faster than the PCRE engine, in certain circumstances, because of engine specific optimizations.

For instance:
.* vs. (?:.)*
These are logically identical, but the .* is much faster in a php script because of PCRE's simple quantifier optimization. Perl is optimized for both methods listed so there is no performance difference.

- Rook.

?>

Posted: Fri Aug 29, 2003 7:04 pm
by JAM
Agree wit Rook.
Other notes worth mention is that using posix, pcre or perl to search for a string is waste of resource. The usage of php's string-functions (as BDKR mentioned) is the optimal for that.

Looked into four or five different benchmarks of them, and all shows the following.
// Faster
PHP's stringfunctions
PCRE
Posix
// Slower
...tho Perl wasn't mentioned, I cant place it. A guess would be at least before PCRE, because it's more direct.

Posted: Sat Aug 30, 2003 7:20 pm
by m3rajk
i didn't realize php was using a perl like engine. i thought it was calling to the local perl engine.

my bad.

but i don't understand how making an exec call to use perl would be faster than the native pcre since it then has to get a child process

Posted: Mon Sep 01, 2003 8:07 am
by BDKR
I still want to hear how it came out. Was it faster or not?

Cheers,
BDKR

Posted: Tue Sep 02, 2003 5:11 pm
by tridude
Using the code suggested by BDKR it was actually slightly slower (maybe due to the fact that it had to suck all the data into an array first), averaging about .880 seconds. Replacing the stristr() function with the case insensitive version strstr() function brought the time down to about .755 seconds. Anyway, I'd like to avoid the use of an array as the log files are on the order of 4-6 MB each and depending on the query up to 3 of them will need to be searched.

By substituting the preg_match() function with the strstr() function in my original code, I was able to get the time down to about .522 seconds. Given the flexibility that PHP allows compared to perl in my situation, that's a difference I can live with.

Posted: Tue Sep 02, 2003 10:31 pm
by BDKR
tridude wrote: Using the code suggested by BDKR it was actually slightly slower (maybe due to the fact that it had to suck all the data into an array first), averaging about .880 seconds. Replacing the stristr() function with the case insensitive version strstr() function brought the time down to about .755 seconds. Anyway, I'd like to avoid the use of an array as the log files are on the order of 4-6 MB each and depending on the query up to 3 of them will need to be searched.
I agree that it was most likely slower for the reason you stated. I suspected as much going in, however, I knew that a parsing of the file was going to have to take place somewhere.
tridude wrote: By substituting the preg_match() function with the strstr() function in my original code, I was able to get the time down to about .522 seconds. Given the flexibility that PHP allows compared to perl in my situation, that's a difference I can live with.
Good stuff!

I'll say one thing for sure. I was suprised at how few lines it took to do this in Perl. I'm not a Perl Monk so I suspect I'll see more things like this from Perl.

I wasn't suprised by the speed though. :roll:

And on the topic of speed, there is a lot of talk about improvements to the engine that could very well have a good effect on this kind of thing. I suspect that the Zend engines, both 1 and 2, have plenty of areas where they can improve the perfomance overall. Check this link here...

http://php.weblogs.com/discuss/msgReader$2870

There was more talk about the low level memory manager. Maybe with more stuff like this going on, PHP will close the gap with Perl where performance is concerned.

At least it's still faster than Ruby and Python. And in many case, Java as well.

Cheers,
BDKR

Posted: Tue Sep 02, 2003 11:37 pm
by Stoker
In my own little geeky world in my head I have some basic rules when it comes to regex..

1. Never use regex if you dont need it, use str_pos or str_replace and such (which was the solution here)
2. Never ever use Posix regex, and please join the crusade to get it taken out out PHP :wink:
3. If you are going to use a fairly complex regex and it must be efficient, consolidate forums and/or read Oreillys Regex book from beginning to end, it offers extensive understanding on how the perl regex engine works, e.g. why a 6K regex is much faster to validate an email address than a 4K regex even if they do the same job... I dont remember much of that book now but its in my head and I look stuff up when wondering about specifics..