Performance of PHP4 objects vs procedural and C++.

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

PHP Stabilisation Delayed A Week

:( Hopefully Thursday there will be better news.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

Heavy wrote:My parser is quite general.
Because of this I need to do lots of checking/data juggling. I've been wondering whether I should optimize it with the shotgun or the scalpel, ie scrap the design initiative or just find small pieces could be improved.
I can't say I look carefully at the code, but glancing at it, the first couple of things I noticed is that it is a very big class with lots of options that could be handled differently, and that there is unnecessary use of properties where locals would suffice, and there are loops within loops so it is not clear exactly what the algorithm is. My general response is: Yikes!

If you really want some refactoring help then perhaps post an example script using these classes along with data to parse. For a well coded lexer, take a look at the SImpleTest source.
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Yikes!

Yeah, I know. I was reluctant to letting anyone see the code :D.
The code works but it isn't cleaned up yet, where "clean up" means learn something new and spend another month with the most boring kind of programming tasks.

I think the hardest part is to carry the data around, which leads to the question; do I have to? Which might lead to a huge redesign that I am not very happy to do any time soon.

The "design" is what came up after several weeks of struggling with the logic. It isn't written from a clear definition and I think that's the biggest mistake I made - not knowing in which direction we were going. And the first version of this giant baby was only 300 lines with the inner loop around 60 lines or so. Nothing huge, but rather neat. But then came the special cases and the impact of enabling the flexibility I needed...

Maybe I will get back to the subject later. I was merely wondering. As the parser is working as it is I am concentrating on other parts of the project first.

Thanks
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Anyone considering writing a parser should take a moment meditating on the fact that it won't be good until you've rewritten it five times.

This time I'm getting closer.

I have not decided yet whether to stick with PHP or move the parser out to C++, but I have made some experiments with C++ that are really promising.

I started to write my own regexp engine. It took me a week effectively (I'm not a fast code designer.) although it's not really finished yet.

But this it does:
Supports subpatterns, quantifiers, character classes and special entities like \d \D \w \W \s \S.
It can re-run a test on a new string without reparsing the pattern, and that is a key feature, very likely to significantly boost performance in a use case like a parser where the same test occurs very often, but on different strings.

Here's the figures on my p4 laptop with 2,6 GHz CPU and 333 MHz RAM:
A regexp pattern object takes ~30 milliseconds to instantiate and parse the pattern into small matcher objects.

Here's my test pattern:
<(\w+)\s*((\w+)="([^"]*)"\s*)+/>

Since my parser I will write would slice that test up in smaller pieces (there's a context switch when going into or out from the tag, the attribute and so on) it will practically NEVER test such a long pattern. But anyway,
My benchmark uses that pattern to test following string:
<test class="frenetic" onmouseover="alert(1);" src="gui/smiley.png"/>

And it reports a match in only 30 microseconds, which I think is way better than I expected for a first attempt at writing a regexp library.

I will add the possibilty to use non greedy quantifiers, which will of course waste a significant amount of cycles when used. But I would only use it in rare cases.

30 µs per match means 33 333 matches per second, which would probably be a good thing inside a parser.

I haven't started working on the parser yet, but I was thinking of the same code design as the regexp library. I'll publish the code and make a link here when done.
User avatar
BDKR
DevNet Resident
Posts: 1207
Joined: Sat Jun 08, 2002 1:24 pm
Location: Florida
Contact:

Post by BDKR »

Awesome stuff. Never written a parser myself, but I think it's fascinating stuff (if I didn't have turbocharged cars on the brain, I'd prolly have some time to burn on it too).

Keep us posted.

:wink:
User avatar
BDKR
DevNet Resident
Posts: 1207
Joined: Sat Jun 08, 2002 1:24 pm
Location: Florida
Contact:

Post by BDKR »

Someother things occurred to me too. This is just kind of a brain dump for the most part.

If PHP was mutli-threading capable ( like say Python :twisted: ) the associated overhead of dealing with your huge arrays
wouldn't be so bad. Now of course, the reason it's bad is because you're using a "Shlemiel the painter's algorithm". LOL!!!
I first read about this algo on Joel Spolsky's site and it cracked me up.

From http://www.joelonsoftware.com/articles/ ... 00319.html
Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. "That's pretty good!" says his boss, "you're a fast worker!" and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. "Well, that's not nearly as good as yesterday, but you're still a fast worker. 150 yards is respectable," and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. "Only 30!" shouts his boss. "That's unacceptable! On the first day you did ten times that much work! What's going on?"

"I can't help it," says Shlemiel. "Every day I get farther and farther away from the paint can!"
Now it occurs to me that perhaps we can increase the number of Shlemiel's via threading. Break up the array into smaller portions and have a Shlemiel for each. The Shlemiel that finds the wanted data first reports back thereby terminating the search in the other threads. There is a name for this algorithm somewhere I'm sure. LOL...

So of course, this kicks PHP out of the box as it's not multi-threaded. Oh well....

Anyway, just some rambling and I thought I would throw it in.

Cheers,
BDKR
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Thanks! ;)

What I did before was creating a copy of each token found in the string, and put it in the Shlemiel array along with its info, like:

Code: Select all

"ab"
gets parsed to

array(
   array(
      TOKEN => 'LITERAL_A',
      CHUNK => 'a',
      intPos => 0,
   ),
   array(
      TOKEN => 'LITERAL_B',
      CHUNK => 'b',
      intPos => 1,
   ),
)

and so on.
That ain't too pretty... If it was just about checking for correct syntax, it would be somewhat OK. But I want to do more than that, and I do more than that which is why it is so slow today I think... I am talking about parsing the string, getting references to the data therein and start juggling the data around. Moving chunks into new places. It was a bit hard with that storage technique.

In the next parser I will instead try to keep a reference to the data in the input string via offsets and chunk lengths and not move or copy them around.

Code: Select all

"ab" gets parsed to

array(
   array(
      TOKEN => 'LITERAL_A',
      offset => 0,
      length => 1,
   ),
   array(
      TOKEN => 'LITERAL_B',
      offset => 1,
      length => 1,
   ),
)
please look at the inner arrays as they were small DOM-objects. I know I am representing it as a PHP array here. We'll see if I have reason to do otherwise. If I use C++ I'll probably try to make some tricks to get the most important benefits both from the vector and the linked list.
Anyway, If I post syntax checking like to move data around, it would be by means of taking a node in the DOM-tree and jacking it in in a new place, ie just moving a reference or a pointer if I do this in C++.

Since the last parser catastrophy I am obsessed by getting as much as possible done without touching strings, because it seems strings are the bad guys. Carrying along integers is a lot lighter task.

OK, I see many weak spots in my presentation here, but I guess trying to finish the design and implement it would nail the flying papers down ;)
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Here are some figures from my regexp testing:

Code: Select all

jonas@jw ~/development/proj/cppregexp/build $ ./cppregexp
Pattern: (\d+)
 Number of capturing subpatterns: 2
Subject "<img src="gui/smiley.png"/>" does not match.
Time was: 0.0142948 ms per iteration and 7147.4 ms in total, using 500000 iterations.

Subject "<test class="frenetic" onmouseover="alert(1);" src="gui/smiley.png"/>" matches.stopped at offset "43".
Match 0:1
Match 1:1
Time was: 0.0216862 ms per iteration and 10843.1 ms in total, using 500000 iterations.

Subject "wwedeeeeeeeeeeeeeeeeeeeeeeeew123" matches.stopped at offset "32".
Match 0:123
Match 1:123
Time was: 0.015756 ms per iteration and 7877.98 ms in total, using 500000 iterations.

Subject "123qqqwwwwwwwwwwwwwwwwwwwwwwww" matches.stopped at offset "3".
Match 0:123
Match 1:123
Time was: 0.00208739 ms per iteration and 1043.69 ms in total, using 500000 iterations.

Subject "123" matches.stopped at offset "3".
Match 0:123
Match 1:123
Time was: 0.00180163 ms per iteration and 900.816 ms in total, using 500000 iterations.

Subject "xxx sohuoisvv wefoiu 123xxwefoiuwe f9u8h weef9x" matches.stopped at offset "24".
Match 0:123
Match 1:123
Time was: 0.012342 ms per iteration and 6171 ms in total, using 500000 iterations.




**********************************************************
Pattern: ^(\d+)
 Number of capturing subpatterns: 2
Subject "<img src="gui/smiley.png"/>" does not match.
Time was: 0.00126005 ms per iteration and 630.023 ms in total, using 500000 iterations.

Subject "<test class="frenetic" onmouseover="alert(1);" src="gui/smiley.png"/>" does not match.
Time was: 0.00128246 ms per iteration and 641.232 ms in total, using 500000 iterations.

Subject "wwedeeeeeeeeeeeeeeeeeeeeeeeew123" does not match.
Time was: 0.0012806 ms per iteration and 640.3 ms in total, using 500000 iterations.

Subject "123qqqwwwwwwwwwwwwwwwwwwwwwwww" matches.stopped at offset "3".
Match 0:123
Match 1:123
Time was: 0.00217044 ms per iteration and 1085.22 ms in total, using 500000 iterations.

Subject "123" matches.stopped at offset "3".
Match 0:123
Match 1:123
Time was: 0.00181529 ms per iteration and 907.642 ms in total, using 500000 iterations.

Subject "xxx sohuoisvv wefoiu 123xxwefoiuwe f9u8h weef9x" does not match.
Time was: 0.00115548 ms per iteration and 577.741 ms in total, using 500000 iterations.




**********************************************************
Pattern: (\d+)$
 Number of capturing subpatterns: 2
Subject "<img src="gui/smiley.png"/>" does not match.
Time was: 0.0135828 ms per iteration and 6791.42 ms in total, using 500000 iterations.

Subject "<test class="frenetic" onmouseover="alert(1);" src="gui/smiley.png"/>" does not match.
Time was: 0.0217067 ms per iteration and 10853.3 ms in total, using 500000 iterations.

Subject "wwedeeeeeeeeeeeeeeeeeeeeeeeew123" matches.stopped at offset "32".
Match 0:123
Match 1:123
Time was: 0.0158357 ms per iteration and 7917.84 ms in total, using 500000 iterations.

Subject "123qqqwwwwwwwwwwwwwwwwwwwwwwww" does not match.
Time was: 0.00214752 ms per iteration and 1073.76 ms in total, using 500000 iterations.

Subject "123" matches.stopped at offset "3".
Match 0:123
Match 1:123
Time was: 0.00178046 ms per iteration and 890.232 ms in total, using 500000 iterations.

Subject "xxx sohuoisvv wefoiu 123xxwefoiuwe f9u8h weef9x" does not match.
Time was: 0.0121033 ms per iteration and 6051.64 ms in total, using 500000 iterations.




**********************************************************
Pattern: ^(\d+)$
 Number of capturing subpatterns: 2
Subject "<img src="gui/smiley.png"/>" does not match.
Time was: 0.00122196 ms per iteration and 610.98 ms in total, using 500000 iterations.

Subject "<test class="frenetic" onmouseover="alert(1);" src="gui/smiley.png"/>" does not match.
Time was: 0.00130277 ms per iteration and 651.385 ms in total, using 500000 iterations.

Subject "wwedeeeeeeeeeeeeeeeeeeeeeeeew123" does not match.
Time was: 0.00124044 ms per iteration and 620.222 ms in total, using 500000 iterations.

Subject "123qqqwwwwwwwwwwwwwwwwwwwwwwww" does not match.
Time was: 0.0021179 ms per iteration and 1058.95 ms in total, using 500000 iterations.

Subject "123" matches.stopped at offset "3".
Match 0:123
Match 1:123
Time was: 0.00189618 ms per iteration and 948.091 ms in total, using 500000 iterations.

Subject "xxx sohuoisvv wefoiu 123xxwefoiuwe f9u8h weef9x" does not match.
Time was: 0.00124018 ms per iteration and 620.089 ms in total, using 500000 iterations.
The patterns are trivial, but I like the results anyway :)
User avatar
BDKR
DevNet Resident
Posts: 1207
Joined: Sat Jun 08, 2002 1:24 pm
Location: Florida
Contact:

Post by BDKR »

Dude!!! You're awesome! Make sure we all get to read whatever you write up on this.
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Hehe!
Thanks!

I should emphazise that the times do NOT include instantiation of the regexp object and parsing the pattern.

The test looks like this:

Code: Select all

#include <iostream>
#include <cstdlib>
#include <fstream>
#include <string>
#include <sstream>
#include "matcherbase.h"
#include "matcher_subpattern.h"
#include "regexp.h"
#include "time.h"
using namespace std;
using namespace cppregexp;
#define COUNT 5
int main(int argc, char *argv[])
{
	Regexp * r = 0;
	Time t;
	double ela = 0;
// 	Regexp regexp(string("<(\\w+)\\s*((\\w+)=\"([^\"]*)\"\\s*)*/>"));
	list<Regexp *> pl;
	pl.push_back( new Regexp(string("(\\d+)")));
	pl.push_back( new Regexp(string("^(\\d+)")));
	pl.push_back( new Regexp(string("(\\d+)$")));
	pl.push_back( new Regexp(string("^(\\d+)$")));
	
	list<string> ls;
// 	ls.push_back("<img src=\"gui/smiley.png\"/>");
// 	ls.push_back("<test class=\"frenetic\" onmouseover=\"alert(1);\" src=\"gui/smiley.png\"/>");
//	ls.push_back("<br />");
//	ls.push_back("www");
	ls.push_back("wwedeeeeeeeeeeeeeeeeeeeeeeeew123");
	ls.push_back("1wwedeeeeeeeeeeeeeeeeeeeeeeeew123");
// 	ls.push_back("123qqqwwwwwwwwwwwwwwwwwwwwwwww");
// 	ls.push_back("123");
// 	ls.push_back("xxx sohuoisvv wefoiu 123xxwefoiuwe f9u8h weef9x");
//	ls.push_back("wwwa");
//	ls.push_back("<imgwwwwwwwwwwwwwwwwwwwwwwwwwwwww/>");
//	ls.push_back("<b/>");
//	ls.push_back("</>");
	bool mmm = false;
	
	for (list<Regexp *>::iterator rit = pl.begin(); rit != pl.end(); ++rit){
		if (rit != pl.begin()) cerr << "\n\n\n**********************************************************" << endl;
		
		cerr << "Pattern: " << (*rit)->pattern() << "\n Number of capturing subpatterns: " << (*rit)->subPatternCount() << endl; 
		for (list<string>::iterator it = ls.begin(); it != ls.end(); ++it){
			ela = 0;
			for (int a = 0 ; a < COUNT; a++){
				t.start();
				
				mmm = (*rit)->matches(&(*it));
				ela += t.elapsed_ms();
			}
			if (mmm){
				cerr << "Subject \"" << *it << "\" matches." << 
				"stopped at offset \"" << (*rit)->subPattern(0)->offset() << "\"." << endl;
				for (int b = 0; b < (*rit)->subPatternCount(); b++){
					cerr << "Match " << b << ":" << (*rit)->match(b) << endl;
				}
			}else
				cerr << "Subject \"" << *it << "\" does not match." << endl;
	
			cerr << "Time was: " << ((float)ela/COUNT) << " ms per iteration and " << ela << " ms in total, using " << COUNT << " iterations.\n" << endl;
		}
	}
  return EXIT_SUCCESS;
}
Also, I like to say that I use naming that is not meaningless inside the library! :D
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Latest hilarious news on the performance:

Code: Select all

jonas@jw ~/development/proj/cppregexp/build $ LD_LIBRARY_PATH=.:/home/jonas/development/installed/lib  ./benchmarkcppregexp
Pattern: ^(\w+)="([^"]+)"
 Number of capturing subpatterns: 3
Subject "<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>eneon</title>
</head>
<body  onload="onLoad()" id="body">

<img />

<span id="test">test</span>
</body>
</html>
" does not match.
Time was: 0.000248508 ms per iteration and 2.48508 ms in total, using 10000 iterations.
The pattern is already parsed and compiled when we enter the test loop so such timings are not included.

I started competing with preg_match and was truly disappointed when my timings per iteration were about four times slower than with preg_match.

Then I made my own string class and a stringchecker class to inspect string data without needing to make any copies of the data before checking it. and Voila!!! Now it is between 3 to 10 times faster than preg_match depending on the pattern and subject.
I also have a multimatch mode similar to preg_match_all, where my version is approx 50% slower last time I tested it.

When the pattern is not super-trivial, like ^\d+ I have a huge advandage over preg_match as I can reuse the pattern without recompiling it.
But I am sure preg has better code...

Now take a look at the time above and compare it with the similar situation in the earlier result:

Code: Select all

Pattern: ^(\d+)
 Number of capturing subpatterns: 2
Subject "<img src="gui/smiley.png"/>" does not match.
Time was: 0.00126005 ms per iteration and 630.023 ms in total, using 500000 iterations.
Here we can see that the time it takes to return a negative match was improved by 5 times.

It is very important that returning a false match using must-match-start (^[pattern]) is really fast, because as my bigger parser advances over the document, it will test the wrong pattern almost every time the document position is tested. Only a few percent of the testing will not fail immediately.

The regexp library is finished now and as a bonus, I've learned to code some python as I use scons for build system.

Now I will dive into speeding up the big parser. I have great hope!
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

What version of PHP are you using in regards to preg_match for comparison?
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

You genius! Image
User avatar
Heavy
Forum Contributor
Posts: 478
Joined: Sun Sep 22, 2002 7:36 am
Location: Viksjöfors, Hälsingland, Sweden
Contact:

Post by Heavy »

Buddha443556 wrote:What version of PHP are you using in regards to preg_match for comparison?
phpinfo() says:
PHP Version 5.1.1-gentoo
PCRE Library Version 6.2 01-Aug-2005

The test is trying to be fair.
The data is preloaded from a file with file_get_contents() and stored in a $string, just like in the C++ version. The php version of the test looks like this:

Code: Select all

<?php

header ("Content-type: text/plain; charset=iso-8859-1;");

function parseTime ($arr){
	return (float) $arr['sec'] + $arr['usec']/1000000;
}

$str = file_get_contents("/home/jonas/development/proj/uniparser/testdata/xhtml2.html");
$count = 10000	;
$time[0] = gettimeofday();
while($a++ < $count ){
	// preg_match("%\d+%", $str, $result);
	preg_match("%^(\\w+)="([^"]+)"%", $str, $result);
	// preg_match_all("%<(\\w+)\\s*((\\w+)="([^"]*)"\\s*)*/? >%", $str, $result);
}
$time[1] = gettimeofday();
$msec = ((parseTime($time[1]) - parseTime($time[0])) * 1000);

print_r($result);
echo "Time was: " . ($msec/$count) . "ms\nTotal time: $msec\n";
?>
I have no idea how PHP5.1 passes strings to that function, but it seems there is no unnecessary copying going on. (Don't know.)
I had to access the string data in memory directly to be able to compete with preg_match at all, and that's knowing that preg_match also has to reparse the pattern on every call (how impressive!!!)

On the other hand, a pattern like ^\d+ isn't very much to parse, so when I compare with that, my version isn't as much faster as when doing for example ^(\w+)="([^"]+)" which is rather invalid to test for anyway. (matching an XML-attribute at the start och a file...)

But it makes me feel good that even if I write complex patterns I am sure it'll be rather fast to report a failure.
The big parser utilizing this little regexp thingy will probably only try matching beginning of strings, as the parser tests all entities and must find a match to advance. Syntax is not loose. I can define complete languages, and everything MUST match perfectly. Thus, making a ^[pattern] test on substrings as the parser advances over the document is very relevant.
Post Reply