Please take a look at this Class and comment on the API

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Please take a look at this Class and comment on the API

Post by Ambush Commander »

This is basically my testing harness page. I'm starting off to make an HTML parser, sorta like Tidy, but not really (it doesn't try to fix your errors, it just converts <> to &lt;&gt; when it knows something is wrong).

I haven't actually started implementing the class yet, but I've got this test harness which maps out the API of the expected new class. Can you guys comment on it (the API)? And yes, I'm thinking up of more tests to use on the class.

Code: Select all

<?php

require_once ( 'TWP_Parser_HTML.php' );
require_once ( 'PHPUnit.php' );

class TWP_Parser_HTML_Test extends PHPUnit_TestCase
{
    
    var $object;
    
    // constructor of the test suite
    function ParagraphTest($name) {
       $this->PHPUnit_TestCase($name);
    }
    
    function setUp() {
        $this->object = new TWP_Parser_HTML;
        $this->object->setOption('multiline',false);
        $this->object->setOption('strip_invalid', 0);
    }
    
    function tearDown() {
        unset($object);
    }
    
    function testSanitize() {
        $input = '<html>';
        $this->object->setRaw($var);
        $output = $this->object->getClean();
        $expected = '<html>';
        $this->assertTrue($output === $expected, "\n\tInput: '$input'\n\tOutput: '$output'\n\tExpects: '$expected'\n\t>");
    }
    
    function testCloseTags() {
        $input = '<b>This is bold text';
        $this->object->setRaw($var);
        $output = $this->object->getClean();
        $expected = '<b>This is bold text</b>';
        $this->assertTrue($output === $expected, "\n\tInput: '$input'\n\tOutput: '$output'\n\tExpects: '$expected'\n\t>");
    }
    
    function testNesting() {
        $input = '</i></b><b><i></b></i><span></i>';
        $this->object->setRaw($var);
        $output = $this->object->getClean();
        $expected = '<i><b><b><i></i></b><i><span><i></span>';
        $this->assertTrue($output === $expected, "\n\tInput: '$input'\n\tOutput: '$output'\n\tExpects: '$expected'\n\t>");
    }
    
    function testInvalidAttribute() {
        $input = '<a href="#" onClick="javascript:alert(\'foo!\');">Ohm</a>';
        $this->object->setRaw($var);
        $output = $this->object->getClean();
        $expected = '<a href="#" onClick="javascript:alert(\'foo!\');">Ohm</a>';
        $this->assertTrue($output === $expected, "Scripting onClick\n\tInput: '$input'\n\tOutput: '$output'\n\tExpects: '$expected'\n\t>");
        
        $input = '<div style="position:fixed;">Foobar</div>';
        $this->object->setRaw($var);
        $output = $this->object->getClean();
        $expected = '<div style="position:fixed;">Foobar</div>';
        $this->assertTrue($output === $expected, "Layout Breaking position CSS\n\tInput: '$input'\n\tOutput: '$output'\n\tExpects: '$expected'\n\t>");
    }
    
}

$suite  = new PHPUnit_TestSuite("TWP_Parser_HTML_Test");
$result = PHPUnit::run($suite);

header('Content-type: text/plain');

echo $result -> toString();



?>
McGruff
DevNet Master
Posts: 2893
Joined: Thu Jan 30, 2003 8:26 pm
Location: Glasgow, Scotland

Re: Please take a look at this Class and comment on the API

Post by McGruff »

I don't claim to be an experienced tester but for what it's worth here's my thoughts, for better or worse.
Ambush Commander wrote:I haven't actually started implementing the class yet
Excellent! I must admit I don't always follow TDD perfectly (write failing test then write some code) but it is a very nice way to work. Test-first helps to stop you linking the tests too closely to the implementation rather than, as they should be, minimal constraints on all possible solutions. If you've just written a class your mind can be too full of the solution you chose and you start writing tests for that. I find that's an easy trap to fall into. It's not all bad: this will at least verify the code you just wrote but, crucially, what they don't do is provide a proper critique for later refactorings. Another solution could be valid but still fail tests which are implementation-specific.

Interface - how about:

Code: Select all

$this->object->getClean($suspect)
Simpler the better.


If unclosed tags are automatically closed, perhaps testSanitise should assert that the unclosed <html> tag is closed? In general I'd be inclined just to discard any unclosed tags since you don't really know where the author of the text meant to close them - except for tags like html and body. A new test: testDiscardUnclosedTagsExceptHtmlAndBody?

If you stick with closing all unclosed tags, will you need: testWithMultipleUnClosedTags?

"testNesting" asserts that overlapping tags <b><i></b></i> will be resolved into <b><i></i></b>. Perhaps they should all just be discarded? You wouldn't want to <span><div></div></span>. Instead of discarding, you could get clever with some rules to resolve these issues on a tag-by-tag basis but that will add a lot of complexity under the bonnet - although not for the user.

I noticed you've got set up options (multiline etc) planned for the class. Instantiating the object (maybe give it a more explanatory name like "validator" or whatever rather than $this->object) in setUp means that these are fixed for all tests rather than being able to explore different options in different test methods. Maybe that's what you want.

TWP_Parser_Html doesn't immediately tell me what the class does. It's hard to get good names but very important. A thesaurus can be handy. I wouldn't claim that I always succeed but I spend a lot of time trying to find good names. I'll tolerate unwieldy length much more than obscurity.

You probably don't need to assert type when comparing strings unless you really need to differentiate between, say, a null and an empty string. At the same time I think you do have to pay a bit more attention to type than normal in tests. In the testing framework I use (SimpleTest) assertTrue doesn't compare type - there's an assertIdentical for that. There are other type-insensitive methods. I'm assuming phpUnit is similar - it may not be. Type doesn't make a difference here (one half of the equation is always a string) but generally you don't want to restrict the solution any more than you have to and therefore shouldn't assert type unless you really need to. Another example (again in SimpleTest) if you assertEqual two hashes, the key order doesn't matter. With assertIdentical, if two arrays have the same key=>value pairs but in different orders, the comparison returns false. Numerical arrays always have to have the same order for SimpleTest assertEqual to return true (I hacked together an assertIdenticalArrayValues method for that).

SimpleTest has a webtester which can enter values in form fields and submit the form. With this, you might be able to submit a few carefully chosen examples to a web xhtml validator etc, and check if your tidied html validates, if that would be useful. I can't remember if you can submit strings for validation as well as urls. At the least, with finished pages, you could take the grunge out of validating an entire website each time you change the design. Out of courtesy, you wouldn't want to be clogging up their bandwidth with thousands of tests every hour, which would be quite possible to do with automated tests.

One final ad for SimpleTest: afaik it's the only testing framework with mock objects. This allows you to use testing more as a design tool. Applications can be written top down, mocking out neighbouring objects as you go. The mocks define interfaces which you can come back and fill in later.

Fowler's take on mocks v stubs: Mocks aren't Stubs. I'm more of a mock-ist.
Last edited by McGruff on Sat Aug 06, 2005 11:26 am, edited 1 time in total.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I don't claim to be an experienced tester but for what it's worth here's my thoughts, for better or worse.
Neither am I :). Thanks for such a comprehensive comment!
Excellent! I must admit I don't always follow TDD perfectly (write failing test then write some code) but it is a very nice way to work.
This is the first time I'm actually using it, but I've read about it and it does sound very cool.
Test-first helps to stop you linking the tests too closely to the implementation rather than, as they should be, minimal constraints on all possible solutions. If you've just written a class your mind can be too full of the solution you chose and you start writing tests for that. I find that's an easy trap to fall into. It's not all bad: this will at least verify the code you just wrote but, crucially, what they don't do is provide a proper critique for later refactorings. Another solution could be valid but still fail tests which are implementation-specific.
See, the funny thing is I did have an HTML parser that I used (and was stack-based), but it was really simplistic and didn't renest tags, so I decided I wanted to wrap the logic in a class rather in just a function. So I do have a little idea about how I want to implement it. Hopefully my tests are NPOV.
Interface - how about:

Code: Select all

$this->object->getClean($suspect)
Simpler the better.
That probably would work. Good idea.
If unclosed tags are automatically closed, perhaps testSanitise should assert that the unclosed <html> tag is closed? In general I'd be inclined just to discard any unclosed tags since you don't really know where the author of the text meant to close them - except for tags like html and body. A new test: testDiscardUnclosedTagsExceptHtmlAndBody?
Actually, it was to eliminate a tag that I didn't want to keep. Hmm... I should expand that test. I white-list certain tags and allow them to be used.
If you stick with closing all unclosed tags, will you need: testWithMultipleUnClosedTags?

"testNesting" asserts that overlapping tags <b><i></b></i> will be resolved into <b><i></i></b>. Perhaps they should all just be discarded? You wouldn't want to <span><div></div></span>. Instead of discarding, you could get clever with some rules to resolve these issues on a tag-by-tag basis but that will add a lot of complexity under the bonnet - although not for the user.
Actually, that was part of the whole idea. You can't nest block-level elements in P tags, TABLE tags can't contain B tags unless they're inside a TR and TD tag. I've been studying the HTML 4.1 specification recently and the bulk of the programming, I think, is the planned checkContext($tag,$tag_tree) which makes sure a tag that is being opened is allowed in that context (the same can be said for text).

Tidy, from what I've seen, tries to be smart about misnested tags. I don't want this to be smart, so I'm just going to have it give a high priority to opening tags, and then kill off or add closing tags as necessary.
I noticed you've got set up options (multiline etc) planned for the class. Instantiating the object (maybe give it a more explanatory name like "validator" or whatever rather than $this->object) in setUp means that these are fixed for all tests rather than being able to explore different options in different test methods. Maybe that's what you want.
Never thought of it that way: I was just using the options to narrow down the functioning of it while the other options could be easily extended... yeah I'm going to have to document and test those.
TWP_Parser_Html doesn't immediately tell me what the class does. It's hard to get good names but very important. A thesaurus can be handy. I wouldn't claim that I always succeed but I spend a lot of time trying to find good names. I'll tolerate unwieldy length much more than obscurity.
What can I say? That's a very good idea.
You probably don't need to assert type when comparing strings unless you really need to differentiate between, say, a null and an empty string. At the same time I think you do have to pay a bit more attention to type than normal in tests. In the testing framework I use (SimpleTest) assertTrue doesn't compare type - there's an assertIdentical for that. There are other type-insensitive methods. I'm assuming phpUnit is similar - it may not be. Type doesn't make a difference here (one half of the equation is always a string) but generally you don't want to restrict the solution any more than you have to and therefore shouldn't assert type unless you really need to. Another example (again in SimpleTest) if you assertEqual two hashes, the key order doesn't matter. With assertIdentical, if two arrays have the same key=>value pairs but in different orders, the comparison returns false. Numerical arrays always have to have the same order for SimpleTest assertEqual to return true (I hacked together an assertIdenticalArrayValues method for that).
Err... assertTrue simply asserts that the expression is true (no comparison). But I think I get the jist of what you're saying.
SimpleTest has a webtester which can enter values in form fields and submit the form. With this, you might be able to submit a few carefully chosen examples to a web xhtml validator etc, and check if your tidied html validates, if that would be useful. I can't remember if you can submit strings for validation as well as urls. At the least, with finished pages, you could take the grunge out of validating an entire website each time you change the design. Out of courtesy, you wouldn't want to be clogging up their bandwidth with thousands of tests every hour, which would be quite possible to do with automated tests.
XHTML validation. Good idea. (you know, I've always wondered why they never released the validator as open code that could be operated locally).
One final ad for SimpleTest: afaik it's the only testing framework with mock objects. This allows you to use testing more as a design tool. Applications can be written top down, mocking out neighbouring objects as you go. The mocks define interfaces which you can come back and fill in later.
I'll take a look at SimpleTest. I've visited http://www.lastcraft.com before, but when it came time to find a harness I defaulted to PEAR.

I'll try to think about everything you've said and incorporate it into my tests. It's become apparent to me that my tests are way to vague :oops: Thanks for the response!
McGruff
DevNet Master
Posts: 2893
Joined: Thu Jan 30, 2003 8:26 pm
Location: Glasgow, Scotland

Post by McGruff »

Ambush Commander wrote:Err... assertTrue simply asserts that the expression is true (no comparison).
assertTrue is checking a (type-sensitive) comparison though.

Code: Select all

null === '' (false)
null == '' (true)
OK you know that. If you could have null-type input, and that generates '' output (or vice versa), a testWithNullInput using the === operator will fail. Now, that might be what you want but odds on it wouldn't matter in which case you would use a == comparison to remove the type constraint. I'm just being picky but I think the type comparison might be redundant. It's really a more general point about not constraining the solutions any more than you have to since that could interfere with later refactoring.

That's one I forgot: testWithEmptyStringInput or testWithNullInput - if either are a possibility.

In SimpleTest I'd maybe do this to save a bit of writing:

Code: Select all

// using SimpleTest assertEqual method
foreach($this->samples as $sample_pair) {
    $this->assertEqual($object->getClean($sample_pair['input']), $sample_pair['expected']);
}
Now you just have to add to the $sample_pairs array values to create new tests (define in constructor).

I'm not entirely sure if putting assertions in a loop is a good thing. With just one test, you lose the method names which help to explain exactly what is being tested. It migth be OK for very simple tests but I'd drop it immediately if clarity is at risk. You also lose your custom error messages. However the failure message generated by assertEqual will print the two strings being compared so you can still see where the problem lies.
It's become apparent to me that my tests are way to vague
I'm just nitpicking really. Looks good. Nice to see someone testing.
Last edited by McGruff on Sat Aug 06, 2005 11:23 am, edited 1 time in total.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

OK you know that. If you could have null-type input, and that generates '' output (or vice versa), a testWithNullInput using the === operator will fail. Now, that might be what you want but odds on it wouldn't matter in which case you would use a == comparison to remove the type constraint. I'm just being picky but I think the type comparison might be redundant. It's really a more general point about not constraining the solutions any more than you have to since that could interfere with later refactoring.
Hmm... I guess so... although I expect my output to always be strings, right?
In SimpleTest I'd maybe do this to save a bit of writing:

Code: Select all

// using SimpleTest assertEqual method
foreach($this->samples as $sample_pair) {
    $this->assertEqual($object->getClean($sample_pair['input']), $sample_pair['expected']);
}
Now you just have to add to the $sample_pairs array values to create new tests (define in constructor).

I'm not entirely sure if putting assertions in a loop is a good thing. With just one test, you lose the method names which help to explain exactly what is being tested. It migth be OK for very simple tests but I'd drop it immediately if clarity is at risk. You also lose your custom error messages. However the failure message generated by assertEqual will print the two strings being compared so you can still see where the problem lies.
I'll try that. I can think of some instances where this would be a good idea.

I'm porting it to simple test, and I think I'm going to enjoy using it.

BTW, what's your stance on invalid input? Should I output empty strings? Type cast them to strings? Throw error messages?
McGruff
DevNet Master
Posts: 2893
Joined: Thu Jan 30, 2003 8:26 pm
Location: Glasgow, Scotland

Post by McGruff »

Ambush Commander wrote:BTW, what's your stance on invalid input? Should I output empty strings? Type cast them to strings? Throw error messages?
Do you mean if it receives something other than a string?

I suppose don't create bad input for the class in the first place. The discipline of unit-testing will help to ensure other parts of the app can't pass on anything which will make it choke. That avoids having to add any code to deal with bad input. But it also assumes that you have control of the context. If it was a class in an open source library you might want to do more - or maybe normal php errors will be enough.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

That avoids having to add any code to deal with bad input. But it also assumes that you have control of the context. If it was a class in an open source library you might want to do more - or maybe normal php errors will be enough.
Exactly.

* "normal php errors" - Fatal? Alert? Error?
* "You might want to do more" - What exactly is this "more" you speak of?

I'd like to open-source it (that would be really cool), but yes, controlling the context is important.

Simpletest is really great! Concerning your concerns about the for loop, I added a little functionality to the class, in case anyone is interested...

Code: Select all

class HtmlReporterGrouper extends HtmlReporter
{
    function _getCss() {
        return parent::_getCss() . '
        DIV.group { border-top:1px solid #000;border-bottom:1px solid #000;margin:.5em 0 .5em 2em;}
        H2.groupname { font-weight: 900; font-size:1.4em;margin:0;border-bottom:1px solid #000;background:#DFDFDF;}';
    }
}

class UnitTestCaseGrouper extends UnitTestCase {
    
    var $toggleGroupStatus = false;
    var $reporter;
    
    function UnitTestCaseGrouper() {
        $this->reporter = new HtmlReporterGrouper();
    }
    
    function tearDown() {
        if ($this->toggleGroupStatus) {
            $this->toggleGroup();
        }
    }
    
    function _toggleGroup($togglename = null) {
        if ($this->toggleGroupStatus) {
            $this->toggleGroupStatus = false;
            echo '</div>';
        } else {
            $this->toggleGroupStatus = true;
            if (isset($togglename)) {
                echo '<div class="group"><h2 class="groupname">'.$togglename.'</h2>';
            } else {
                echo '<div class="group">';
            }
        }
    }
    
}
Used like (these are the rest of the tests I've been working on...

Code: Select all

class Test_SingleLineInput extends UnitTestCaseGrouper 
{
    
    ////////////////////////////////////////////////
    //Construction
    
    var $reporter;
    var $parser;
    
    function Test_SingleLineInput() {
       $this->UnitTestCase('Test_SingleLineInput');
       
       if (method_exists($this,'UnitTestCaseGrouper')) {
            $this->UnitTestCaseGrouper();
       } else {
            $this->reporter = new HtmlReporter();
       }
       
       $this->parser = new TWP_Parser_HTMLEscapingSanitizer;
       $this->parser->setOption('multiline',false);
    }
    
    //Compatibility
    function toggleGroup($togglename) {
        if (method_exists($this,'_toggleGroup')) {
            $this->_toggleGroup($togglename);
        }
    }
    
    ////////////////////////////////////////////////
    //Tests
    
    function testInvalidInput() {
        
        $input = '';
        $output = $this->parser->parse($input);
        $expected = '';
        $this->assertIdentical($output,$expected,'Return string empty for string empty input');
        
        $output = $this->parser->parse(null);
        $expected = '';
        $this->assertIdentical($output,$expected,'Return string empty for null input');
        $this->assertNoErrors('Handle without E_STRICT errors');
        
    }
    
    function testSanitize() {
        
        $this->toggleGroup('Test Invalid Tags');
        
        $input_array = array('<html>','<body>','<style>','<asdf>','<Log Out>'
        ,'<be>','<array style="booby:asfd;">','<style type="text/css" boom="boom">'
        ,'<243b>','<form>','<++>','<%@A5yased sfd>','<>');
        
        foreach($input_array as $input) {
            $expected = htmlentities($input, ENT_COMPAT);
            $output = $this->parser->parse($input);
            $this->assertEqual($output, $expected);
        }
        
        $this->toggleGroup();
    }
    
    function testCloseTags() {
        
        $input    = '<b>This is bold text';
        $output   = $this->parser->parse($input);
        $expected = '<b>This is bold text</b>';
        $this->assertTrue($output == $expected, 'Insert ending-tag B at end');
        
        $input    = '<i><b>Bang bang!</i>';
        $output   = $this->parser->parse($input);
        $expected = '<b>This is bold text</b>';
        $this->assertTrue($output == $expected, 'Insert ending-tag B before ending-tag I');
        
    }
    
    function testNesting() {
        $input    = '<span><div></div></span>';
        $output   = $this->parser->parse($input);
        $expected = '<span><div></div></span>';
        $this->assertEqual($output,$expected,'Escape block-level tagset in inline tagset');
        
        $input = '</i></b><b><i></b></i><span></i>';
        $output = $this->parser->parse($input);
        $expected = '<i><b><b><i></i></b><i><span><i></span>';
        $this->assertTrue($output == $expected, 'Gauntlet!');
    }
    
    function testInvalidAttribute() {
        $input = '<a href="#" onClick="javascript:alert(\'foo!\');">Ohm</a>';
        $output = $this->parser->parse($input);
        $expected = '<a href="#" onClick="javascript:alert(\'foo!\');">Ohm</a>';
        $this->assertTrue($output == $expected, 'Escape tag with onClick');
        
        $input = '<div style="position:fixed;">Foobar</div>';
        $output = $this->parser->parse($input);
        $expected = '<div style="position:fixed;">Foobar</div>';
        $this->assertTrue($output == $expected, 'Escape tag with CSS position:fixed');
    }
    
    function testGauntlet() {
        
    }
    
}

$test = &new Test_SingleLineInput;
$test->run($test->reporter);
I'm still working on it, but your suggestions helped a lot.
McGruff
DevNet Master
Posts: 2893
Joined: Thu Jan 30, 2003 8:26 pm
Location: Glasgow, Scotland

Post by McGruff »

Ambush Commander wrote: * "normal php errors" - Fatal? Alert? Error?
* "You might want to do more" - What exactly is this "more" you speak of?
I meant the standard of errors where eg a string gets passed into a foreach loop.

If you have full control over the context in which a class is used, you don't need to do so much input checking. A library class might have to be more robust since it could be used in a wider variety of contexts. Just a general point. I'm not really sure what your requirements are exactly.

If you're TDD-ing, make sure you don't add any unecessary constraints to the test case. One of the good things about TDD is that it tells you when to stop (the test are running green). It's easy to get carried away and add unecessary requirements which you think you might need later. Better to add them later, if you need them.

Again, just a general point. There's probably not a lot to worry about in this particular case.
fastfingertips
Forum Contributor
Posts: 242
Joined: Sun Dec 28, 2003 1:40 am
Contact:

Post by fastfingertips »

This is to much from me.

Why didn't you use a XML parser? Is easier and much more faster.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Because the HTML specification contains constraints that an XML parser does not recognize (if you've ever read the doctype you'll notice that it's not comprehensive).

Plus, I never figured out how to use the XML parser.
Post Reply