Can't wrap my head around it!
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Can't wrap my head around it!
Have you ever written some code that, outwardly, works, but you can't figure out why it works? It just does?
I just finished working on some code that is like that. I couldn't wrap my head around the problem, so I decided to go off what I knew worked, extended it a little, twiddled around with the variables, and now it works. It shouldn't work, but it does. And I'm worried there's some deep logic bug that my testing hasn't poked at yet.
Argh.
I just finished working on some code that is like that. I couldn't wrap my head around the problem, so I decided to go off what I knew worked, extended it a little, twiddled around with the variables, and now it works. It shouldn't work, but it does. And I'm worried there's some deep logic bug that my testing hasn't poked at yet.
Argh.
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
Haha, that happened to me with my own code. I was implementing a Registry pattern (which has simplified so much!) and ended up changing a lot of things, and it wasn't working. I deleted a line, got an error, put the line back, and suddenly it was working again. No idea what I did.
As for your case though, that happens to me whenever I read anything on tokenization. I just don't get it. I wrote my own though and it works. You may not have the option of rewriting it, but if you can, give it a shot.
I don't learn coding by example. Just can't do it. There's too much debugging going from scratch to a complete script for a code snippet to do it for me.
As for your case though, that happens to me whenever I read anything on tokenization. I just don't get it. I wrote my own though and it works. You may not have the option of rewriting it, but if you can, give it a shot.
I don't learn coding by example. Just can't do it. There's too much debugging going from scratch to a complete script for a code snippet to do it for me.
Last edited by superdezign on Sat Jun 23, 2007 10:13 pm, edited 1 time in total.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
It's just really specific. I'll try anyway though.
The system I'm dealing with is a parser for a stream of tokens. The tokens are HTML tags and text, and the parser's job is to make these tokens well-formed. The basic principle is easy enough: keep track of the current nesting and make sure all the closing tag tokens line up with the nesting.
My next job was to implement auto-paragraphing. The first implementation I did was extremely invasive, directly inspecting all of the tokens and, when it noticed a text token with a double newline in it, it inserted a processed version of that text with paragraph tags into the result array, and moved on. The trouble with this approach was that there was no way to implement any other functionality without re-processing the result array.
I tried to genericize the approach and make it work with multiple "Injector"s I called it. One of the primary things I had to do was make sure any tokens the injector returned where injected back into the source token stream, so that other formatters could have a whack at it. Here's where I got confused: how do I make sure that a formatter doesn't re-process a token it's already processed?
Here's the code:
I think this code works.
The system I'm dealing with is a parser for a stream of tokens. The tokens are HTML tags and text, and the parser's job is to make these tokens well-formed. The basic principle is easy enough: keep track of the current nesting and make sure all the closing tag tokens line up with the nesting.
My next job was to implement auto-paragraphing. The first implementation I did was extremely invasive, directly inspecting all of the tokens and, when it noticed a text token with a double newline in it, it inserted a processed version of that text with paragraph tags into the result array, and moved on. The trouble with this approach was that there was no way to implement any other functionality without re-processing the result array.
I tried to genericize the approach and make it work with multiple "Injector"s I called it. One of the primary things I had to do was make sure any tokens the injector returned where injected back into the source token stream, so that other formatters could have a whack at it. Here's where I got confused: how do I make sure that a formatter doesn't re-process a token it's already processed?
Here's the code:
Code: Select all
/**
* Takes tokens makes them well-formed (balance end tags, etc.)
*/
class HTMLPurifier_Strategy_MakeWellFormed extends HTMLPurifier_Strategy
{
function execute($tokens, $config, &$context) {
$definition = $config->getHTMLDefinition();
$generator = new HTMLPurifier_Generator();
$current_nesting = array();
$context->register('CurrentNesting', $current_nesting);
$tokens_index = null;
$context->register('InputIndex', $tokens_index);
$context->register('InputTokens', $tokens);
$result = array();
$context->register('OutputTokens', $result);
$escape_invalid_tags = $config->get('Core', 'EscapeInvalidTags');
// -- begin INJECTOR --
// factor this stuff out to its own class
$injector = array();
$injector_skip = array();
if ($config->get('Core', 'AutoParagraph')) {
$injector[] = new HTMLPurifier_Injector_AutoParagraph();
// decrement happens first, so set to one so we start at zero
$injector_skip[] = 1;
}
if ($config->get('Core', 'AutoLinkify')) {
$injector[] = new HTMLPurifier_Injector_Linkify();
$injector_skip[] = 1;
}
// array index of the injector that resulted in an array
// substitution. This enables processTokens() to know which
// injectors are affected by the added tokens and which are
// not (namely, the ones after the current injector are not
// affected)
$current_injector = false;
$context->register('Injector', $injector);
$context->register('CurrentInjector', $current_injector);
// number of tokens to skip + 1
// before processing, this gets decremented: if it equals zero,
// it means the injector is active and is processing tokens, if
// it is greater than zero, then it is inactive, presumably having
// been the source of the tokens
$context->register('InjectorSkip', $injector_skip);
// -- end INJECTOR --
for ($tokens_index = 0; isset($tokens[$tokens_index]); $tokens_index++) {
// if all goes well, this token will be passed through unharmed
$token = $tokens[$tokens_index];
foreach ($injector as $i => $x) {
if ($injector_skip[$i] > 0) $injector_skip[$i]--;
}
// quick-check: if it's not a tag, no need to process
if (empty( $token->is_tag )) {
// duplicated with handleStart
if ($token->type === 'text') {
foreach ($injector as $i => $x) {
if (!$injector_skip[$i]) {
$x->handleText($token, $config, $context);
}
if (is_array($token)) {
$current_injector = $i;
break;
}
}
}
$this->processToken($token, $config, $context);
continue;
}
$info = $definition->info[$token->name]->child;
// test if it claims to be a start tag but is empty
if ($info->type == 'empty' && $token->type == 'start') {
$result[] = new HTMLPurifier_Token_Empty($token->name, $token->attr);
continue;
}
// test if it claims to be empty but really is a start tag
if ($info->type != 'empty' && $token->type == 'empty' ) {
$result[] = new HTMLPurifier_Token_Start($token->name, $token->attr);
$result[] = new HTMLPurifier_Token_End($token->name);
continue;
}
// automatically insert empty tags
if ($token->type == 'empty') {
$result[] = $token;
continue;
}
// start tags have precedence, so they get passed through...
if ($token->type == 'start') {
// ...unless they also have to close their parent
if (!empty($current_nesting)) {
$parent = array_pop($current_nesting);
$parent_info = $definition->info[$parent->name];
// this can be replaced with a more general algorithm:
// if the token is not allowed by the parent, auto-close
// the parent
if (!isset($parent_info->child->elements[$token->name])) {
// close the parent, then append the token
$result[] = new HTMLPurifier_Token_End($parent->name);
$result[] = $token;
$current_nesting[] = $token;
continue;
}
$current_nesting[] = $parent; // undo the pop
}
// injectors
foreach ($injector as $i => $x) {
if (!$injector_skip[$i]) {
$x->handleStart($token, $config, $context);
}
if (is_array($token)) {
$current_injector = $i;
break;
}
}
$this->processToken($token, $config, $context);
continue;
}
// sanity check: we should be dealing with a closing tag
if ($token->type != 'end') continue;
// make sure that we have something open
if (empty($current_nesting)) {
if ($escape_invalid_tags) {
$result[] = new HTMLPurifier_Token_Text(
$generator->generateFromToken($token, $config, $context)
);
}
continue;
}
// first, check for the simplest case: everything closes neatly
$current_parent = array_pop($current_nesting);
if ($current_parent->name == $token->name) {
$result[] = $token;
continue;
}
// okay, so we're trying to close the wrong tag
// undo the pop previous pop
$current_nesting[] = $current_parent;
// scroll back the entire nest, trying to find our tag.
// (feature could be to specify how far you'd like to go)
$size = count($current_nesting);
// -2 because -1 is the last element, but we already checked that
$skipped_tags = false;
for ($i = $size - 2; $i >= 0; $i--) {
if ($current_nesting[$i]->name == $token->name) {
// current nesting is modified
$skipped_tags = array_splice($current_nesting, $i);
break;
}
}
// we still didn't find the tag, so remove
if ($skipped_tags === false) {
if ($escape_invalid_tags) {
$result[] = new HTMLPurifier_Token_Text(
$generator->generateFromToken($token, $config, $context)
);
}
continue;
}
// okay, we found it, close all the skipped tags
// note that skipped tags contains the element we need closed
$size = count($skipped_tags);
for ($i = $size - 1; $i >= 0; $i--) {
$result[] = new HTMLPurifier_Token_End($skipped_tags[$i]->name);
}
}
// we're at the end now, fix all still unclosed tags
// not using processToken() because at this point we don't
// care about current nesting
if (!empty($current_nesting)) {
$size = count($current_nesting);
for ($i = $size - 1; $i >= 0; $i--) {
$result[] =
new HTMLPurifier_Token_End($current_nesting[$i]->name);
}
}
$context->destroy('CurrentNesting');
$context->destroy('InputTokens');
$context->destroy('InputIndex');
$context->destroy('OutputTokens');
$context->destroy('Injector');
$context->destroy('CurrentInjector');
$context->destroy('InjectorSkip');
return $result;
}
function processToken($token, $config, &$context) {
if (is_array($token)) {
// the original token was overloaded by an injector, time
// to some fancy acrobatics
$tokens =& $context->get('InputTokens');
$tokens_index =& $context->get('InputIndex');
// $tokens_index is decremented so that the entire set gets
// re-processed
array_splice($tokens, $tokens_index--, 1, $token);
// adjust the injector skips based on the array substitution
$injector_skip =& $context->get('InjectorSkip');
$current_injector =& $context->get('CurrentInjector');
$offset = count($token) + 1;
for ($i = 0; $i <= $current_injector; $i++) {
$injector_skip[$i] += $offset;
}
} elseif ($token) {
// regular case
$result =& $context->get('OutputTokens');
$current_nesting =& $context->get('CurrentNesting');
$result[] = $token;
if ($token->type == 'start') {
$current_nesting[] = $token;
} elseif ($token->type == 'end') {
// theoretical: this code doesn't get run because performing
// the calculations inline is more efficient, and
// end tokens (currently) do not cause a handler invocation
array_pop($current_nesting);
}
}
}
}I hope it does 
Edit: In your code you have this:
Now I have to say that I don't like this approach, since for an array of size N, you'll be doing N isset()'s.
Edit: In your code you have this:
Code: Select all
for ($tokens_index = 0; isset($tokens[$tokens_index]); $tokens_index++)-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
Re: Can't wrap my head around it!
No. Although I understand what you mean, we are of different personalities I guess. Like a ringing phone drives me absolute ballistic...not understanding how code works does the same thing. I couldn't continue until the problem is understood.Ambush Commander wrote:Have you ever written some code that, outwardly, works, but you can't figure out why it works? It just does?
I just finished working on some code that is like that. I couldn't wrap my head around the problem, so I decided to go off what I knew worked, extended it a little, twiddled around with the variables, and now it works. It shouldn't work, but it does. And I'm worried there's some deep logic bug that my testing hasn't poked at yet.
Argh.
The way I see it, if you think there might be something deep within it's core that is likely going to break in the future....chances are you are correct so go back and figure it out thouroughly before you continue with anything else. Put your life on the line for the code you write like it's being used to navigate the shuttle, otherwise your being complacent.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
I'm under the impression that isset is a cheap function to call (it's not really a function at all, it's a language construct). Otherwise, I would have to keep track of the expanding size of the $tokens array, since it is being edited while things are going on.Now I have to say that I don't like this approach, since for an array of size N, you'll be doing N isset()'s.
The code is blobby because I want it to run fast. You'll notice how I cleverly avoid calling userland functions unless it is absolutely necessary. If I had it my way, processTokens() would be a macro inlined at parse-time, not a function.I tend to find that if the code is a "blob" script, that is inadvertantly a big mess of procedural code, yes, I struggle to understand it - too much going on for my mind to track. However, if the code base is well structured objects, and is well refined - No, I don't have any problems
I rationalize this problem by claiming that the output of the function is well-defined and easy to test, so as long as you have a comprehensive unit test battery everything is hunky-dory.
So if I found a problem, it's all good.The way I see it, if you think there might be something deep within it's core that is likely going to break in the future....chances are you are correct so go back and figure it out thouroughly before you continue with anything else. Put your life on the line for the code you write like it's being used to navigate the shuttle, otherwise your being complacent.