Page 1 of 1

Can't wrap my head around it!

Posted: Sat Jun 23, 2007 9:55 pm
by Ambush Commander
Have you ever written some code that, outwardly, works, but you can't figure out why it works? It just does?

I just finished working on some code that is like that. I couldn't wrap my head around the problem, so I decided to go off what I knew worked, extended it a little, twiddled around with the variables, and now it works. It shouldn't work, but it does. And I'm worried there's some deep logic bug that my testing hasn't poked at yet.

Argh.

Posted: Sat Jun 23, 2007 10:02 pm
by superdezign
Haha, that happened to me with my own code. I was implementing a Registry pattern (which has simplified so much!) and ended up changing a lot of things, and it wasn't working. I deleted a line, got an error, put the line back, and suddenly it was working again. No idea what I did.

As for your case though, that happens to me whenever I read anything on tokenization. I just don't get it. I wrote my own though and it works. You may not have the option of rewriting it, but if you can, give it a shot.

I don't learn coding by example. Just can't do it. There's too much debugging going from scratch to a complete script for a code snippet to do it for me.

Posted: Sat Jun 23, 2007 10:02 pm
by Benjamin
I know exactly what you mean. I think the fact that your worried about a logic bug means that there may have been something you meant to do/add/check etc, but forgot. It will come back to you.

Posted: Sun Jun 24, 2007 9:48 am
by Oren
Then why don't you post the code here so we can help you?

Posted: Sun Jun 24, 2007 10:52 am
by Ambush Commander
I think I solved it; there was a little thing that I had forgotten to take care of and it resulted in an infinite loop during an edge case. But I'm afraid that you guys probably wouldn't understand it.

Posted: Sun Jun 24, 2007 11:17 am
by Oren
Ambush Commander wrote:But I'm afraid that you guys probably wouldn't understand it.
Explain? (we are too dumb? :P )

Posted: Sun Jun 24, 2007 11:47 am
by Ambush Commander
It's just really specific. I'll try anyway though.

The system I'm dealing with is a parser for a stream of tokens. The tokens are HTML tags and text, and the parser's job is to make these tokens well-formed. The basic principle is easy enough: keep track of the current nesting and make sure all the closing tag tokens line up with the nesting.

My next job was to implement auto-paragraphing. The first implementation I did was extremely invasive, directly inspecting all of the tokens and, when it noticed a text token with a double newline in it, it inserted a processed version of that text with paragraph tags into the result array, and moved on. The trouble with this approach was that there was no way to implement any other functionality without re-processing the result array.

I tried to genericize the approach and make it work with multiple "Injector"s I called it. One of the primary things I had to do was make sure any tokens the injector returned where injected back into the source token stream, so that other formatters could have a whack at it. Here's where I got confused: how do I make sure that a formatter doesn't re-process a token it's already processed?

Here's the code:

Code: Select all

/**
 * Takes tokens makes them well-formed (balance end tags, etc.)
 */
class HTMLPurifier_Strategy_MakeWellFormed extends HTMLPurifier_Strategy
{
    
    function execute($tokens, $config, &$context) {
        
        $definition = $config->getHTMLDefinition();
        $generator = new HTMLPurifier_Generator();
        
        $current_nesting = array();
        $context->register('CurrentNesting', $current_nesting);
        
        $tokens_index = null;
        $context->register('InputIndex', $tokens_index);
        $context->register('InputTokens', $tokens);
        
        $result = array();
        $context->register('OutputTokens', $result);
        
        $escape_invalid_tags = $config->get('Core', 'EscapeInvalidTags');
        
        // -- begin INJECTOR --
        // factor this stuff out to its own class
        
        $injector = array();
        $injector_skip = array();
        
        if ($config->get('Core', 'AutoParagraph')) {
            $injector[] = new HTMLPurifier_Injector_AutoParagraph();
            // decrement happens first, so set to one so we start at zero
            $injector_skip[] = 1;
        }
        
        if ($config->get('Core', 'AutoLinkify')) {
            $injector[] = new HTMLPurifier_Injector_Linkify();
            $injector_skip[] = 1;
        }
        
        // array index of the injector that resulted in an array
        // substitution. This enables processTokens() to know which
        // injectors are affected by the added tokens and which are
        // not (namely, the ones after the current injector are not
        // affected)
        $current_injector = false;
        
        $context->register('Injector', $injector);
        $context->register('CurrentInjector', $current_injector);
        
        // number of tokens to skip + 1
        // before processing, this gets decremented: if it equals zero,
        // it means the injector is active and is processing tokens, if
        // it is greater than zero, then it is inactive, presumably having
        // been the source of the tokens
        $context->register('InjectorSkip', $injector_skip);
        
        // -- end INJECTOR --
        
        for ($tokens_index = 0; isset($tokens[$tokens_index]); $tokens_index++) {
            
            // if all goes well, this token will be passed through unharmed
            $token = $tokens[$tokens_index];
            
            foreach ($injector as $i => $x) {
                if ($injector_skip[$i] > 0) $injector_skip[$i]--;
            }
            
            // quick-check: if it's not a tag, no need to process
            if (empty( $token->is_tag )) {
                
                // duplicated with handleStart
                if ($token->type === 'text') {
                     foreach ($injector as $i => $x) {
                         if (!$injector_skip[$i]) {
                             $x->handleText($token, $config, $context);
                         }
                         if (is_array($token)) {
                             $current_injector = $i;
                             break;
                         }
                     }
                }
                
                $this->processToken($token, $config, $context);
                continue;
            }
            
            $info = $definition->info[$token->name]->child;
            
            // test if it claims to be a start tag but is empty
            if ($info->type == 'empty' && $token->type == 'start') {
                $result[] = new HTMLPurifier_Token_Empty($token->name, $token->attr);
                continue;
            }
            
            // test if it claims to be empty but really is a start tag
            if ($info->type != 'empty' && $token->type == 'empty' ) {
                $result[] = new HTMLPurifier_Token_Start($token->name, $token->attr);
                $result[] = new HTMLPurifier_Token_End($token->name);
                continue;
            }
            
            // automatically insert empty tags
            if ($token->type == 'empty') {
                $result[] = $token;
                continue;
            }
            
            // start tags have precedence, so they get passed through...
            if ($token->type == 'start') {
                
                // ...unless they also have to close their parent
                if (!empty($current_nesting)) {
                    
                    $parent = array_pop($current_nesting);
                    $parent_info = $definition->info[$parent->name];
                    
                    // this can be replaced with a more general algorithm:
                    // if the token is not allowed by the parent, auto-close
                    // the parent
                    if (!isset($parent_info->child->elements[$token->name])) {
                        // close the parent, then append the token
                        $result[] = new HTMLPurifier_Token_End($parent->name);
                        $result[] = $token;
                        $current_nesting[] = $token;
                        continue;
                    }
                    
                    $current_nesting[] = $parent; // undo the pop
                }
                
                // injectors
                foreach ($injector as $i => $x) {
                    if (!$injector_skip[$i]) {
                        $x->handleStart($token, $config, $context);
                    }
                    if (is_array($token)) {
                        $current_injector = $i;
                        break;
                    }
                }
                
                $this->processToken($token, $config, $context);
                continue;
            }
            
            // sanity check: we should be dealing with a closing tag
            if ($token->type != 'end') continue;
            
            // make sure that we have something open
            if (empty($current_nesting)) {
                if ($escape_invalid_tags) {
                    $result[] = new HTMLPurifier_Token_Text(
                        $generator->generateFromToken($token, $config, $context)
                    );
                }
                continue;
            }
            
            // first, check for the simplest case: everything closes neatly
            $current_parent = array_pop($current_nesting);
            if ($current_parent->name == $token->name) {
                $result[] = $token;
                continue;
            }
            
            // okay, so we're trying to close the wrong tag
            
            // undo the pop previous pop
            $current_nesting[] = $current_parent;
            
            // scroll back the entire nest, trying to find our tag.
            // (feature could be to specify how far you'd like to go)
            $size = count($current_nesting);
            // -2 because -1 is the last element, but we already checked that
            $skipped_tags = false;
            for ($i = $size - 2; $i >= 0; $i--) {
                if ($current_nesting[$i]->name == $token->name) {
                    // current nesting is modified
                    $skipped_tags = array_splice($current_nesting, $i);
                    break;
                }
            }
            
            // we still didn't find the tag, so remove
            if ($skipped_tags === false) {
                if ($escape_invalid_tags) {
                    $result[] = new HTMLPurifier_Token_Text(
                        $generator->generateFromToken($token, $config, $context)
                    );
                }
                continue;
            }
            
            // okay, we found it, close all the skipped tags
            // note that skipped tags contains the element we need closed
            $size = count($skipped_tags);
            for ($i = $size - 1; $i >= 0; $i--) {
                $result[] = new HTMLPurifier_Token_End($skipped_tags[$i]->name);
            }
            
        }
        
        // we're at the end now, fix all still unclosed tags
        // not using processToken() because at this point we don't
        // care about current nesting
        if (!empty($current_nesting)) {
            $size = count($current_nesting);
            for ($i = $size - 1; $i >= 0; $i--) {
                $result[] =
                    new HTMLPurifier_Token_End($current_nesting[$i]->name);
            }
        }
        
        $context->destroy('CurrentNesting');
        $context->destroy('InputTokens');
        $context->destroy('InputIndex');
        $context->destroy('OutputTokens');
        
        $context->destroy('Injector');
        $context->destroy('CurrentInjector');
        $context->destroy('InjectorSkip');
        
        return $result;
    }
    
    function processToken($token, $config, &$context) {
        if (is_array($token)) {
            // the original token was overloaded by an injector, time
            // to some fancy acrobatics
            
            $tokens              =& $context->get('InputTokens');
            $tokens_index        =& $context->get('InputIndex');
            // $tokens_index is decremented so that the entire set gets
            // re-processed
            array_splice($tokens, $tokens_index--, 1, $token);
            
            // adjust the injector skips based on the array substitution
            $injector_skip    =& $context->get('InjectorSkip');
            $current_injector =& $context->get('CurrentInjector');
            
            $offset = count($token) + 1;
            for ($i = 0; $i <= $current_injector; $i++) {
                $injector_skip[$i] += $offset;
            }
            
        } elseif ($token) {
            // regular case
            $result =& $context->get('OutputTokens');
            $current_nesting =& $context->get('CurrentNesting');
            $result[] = $token;
            if ($token->type == 'start') {
                $current_nesting[] = $token;
            } elseif ($token->type == 'end') {
                // theoretical: this code doesn't get run because performing
                // the calculations inline is more efficient, and
                // end tokens (currently) do not cause a handler invocation
                array_pop($current_nesting);
            }
        }
    }
    
}
I think this code works.

Posted: Sun Jun 24, 2007 11:53 am
by feyd
I think this is an appropriate time to move this to T&D. :)

Posted: Sun Jun 24, 2007 11:59 am
by Oren
I hope it does :P

Edit: In your code you have this:

Code: Select all

for ($tokens_index = 0; isset($tokens[$tokens_index]); $tokens_index++)
Now I have to say that I don't like this approach, since for an array of size N, you'll be doing N isset()'s.

Posted: Sun Jun 24, 2007 6:33 pm
by Jenk
I tend to find that if the code is a "blob" script, that is inadvertantly a big mess of procedural code, yes, I struggle to understand it - too much going on for my mind to track. However, if the code base is well structured objects, and is well refined - No, I don't have any problems :)

Re: Can't wrap my head around it!

Posted: Sun Jun 24, 2007 7:26 pm
by alex.barylski
Ambush Commander wrote:Have you ever written some code that, outwardly, works, but you can't figure out why it works? It just does?

I just finished working on some code that is like that. I couldn't wrap my head around the problem, so I decided to go off what I knew worked, extended it a little, twiddled around with the variables, and now it works. It shouldn't work, but it does. And I'm worried there's some deep logic bug that my testing hasn't poked at yet.

Argh.
No. Although I understand what you mean, we are of different personalities I guess. Like a ringing phone drives me absolute ballistic...not understanding how code works does the same thing. I couldn't continue until the problem is understood.

The way I see it, if you think there might be something deep within it's core that is likely going to break in the future....chances are you are correct so go back and figure it out thouroughly before you continue with anything else. Put your life on the line for the code you write like it's being used to navigate the shuttle, otherwise your being complacent. :)

Posted: Sun Jun 24, 2007 9:37 pm
by Ambush Commander
Now I have to say that I don't like this approach, since for an array of size N, you'll be doing N isset()'s.
I'm under the impression that isset is a cheap function to call (it's not really a function at all, it's a language construct). Otherwise, I would have to keep track of the expanding size of the $tokens array, since it is being edited while things are going on.
I tend to find that if the code is a "blob" script, that is inadvertantly a big mess of procedural code, yes, I struggle to understand it - too much going on for my mind to track. However, if the code base is well structured objects, and is well refined - No, I don't have any problems
The code is blobby because I want it to run fast. You'll notice how I cleverly avoid calling userland functions unless it is absolutely necessary. If I had it my way, processTokens() would be a macro inlined at parse-time, not a function.

I rationalize this problem by claiming that the output of the function is well-defined and easy to test, so as long as you have a comprehensive unit test battery everything is hunky-dory. :?
The way I see it, if you think there might be something deep within it's core that is likely going to break in the future....chances are you are correct so go back and figure it out thouroughly before you continue with anything else. Put your life on the line for the code you write like it's being used to navigate the shuttle, otherwise your being complacent.
So if I found a problem, it's all good. :D (the trouble with unit testing is that it can only show an absence of errors)...