Page 1 of 1

Intellegent caching of runtime data?

Posted: Sun Nov 26, 2006 6:50 am
by Chris Corbyn
Bear with me here. What I'm asking is not actually complicated, it's just tricky to explain.

I'm facing a situation where I need to come up with some really smart way of caching pre-computed variables on many levels.

To explain the situation I have a MIME document built up like this (I'm sure you'll know what I'm on about):

I'll demonstrate with XML so it's easier to follow, although it's not XML I actually use.

Code: Select all

<document>
    <part>
        <headers>
        </headers>
        <body>
            Some data
            <part>
                <headers>
                </headers>
                <body>
                    More data
                    <part>
                        <headers>
                        </headers>
                        <body>
                            Even more data
                        </body>
                    </part>
                </body>
            </part>
        </body>
    </part>
</document>
It's not complicated, it's just the same structure repeatedly nested one inside the other. In the above, there are basically 3 "sub-documents" represented by <part />. They all have exactly the same structure (headers and a body) but their data will vary, as will the contents of the headers and body.

Processing the output of each of these "parts" is handled separately in a sort of recursive fashion until we get the tree structure. The way that works is that a "part" object can contain other "part" objects and they all have a build().

Because processing may be fairly intensive in order to maintain compliancy with various RFCs I have decided to "cache" each part in a property once it's been computed the first time. If any subsequent changes are made at runtime then the cache is set back to null so it needs to be computed again, Otherwise the cache is returned at the next request rather than processing all the data again.

I cache each "part" independently. The headers are also cached independently. This makes it easier to target any specific "parts" or headers which need to be processed again without having to process the entire document again.

My problem is optimizing the memory I'm using here. Currently I return the cached data by reference everywhere so it's not repeatedly copied into memory BUT I'm still not happy that I'm saving as much memory as I can (and this isn't premature optimization... these "parts" can be several MB).

The problem comes with the fact that each part is cached, then it's parent's cache contains the same data with a few extra bits tagged on; and in turn the parent's parent's cache contains all children below it. So I'm duplicating what's cached.

Can I pick your brains on a clever way to optimize the cache? This needs to be done in memory... no dumping to files or a DB.

Essentially, the question is can I cache substrings along with the parent string whilst only using the amount of memory the parent string uses?

A simple problem following the same logic would be to have this:

Code: Select all

<?php

error_reporting(E_STRICT);

class StringMaker
{
    protected $myCache = null;
    protected $fullCache = null;
    protected $append = array();
    protected $label = null;

    public function __construct($label)
    {
        $this->setLabel($label);
    }

    public function setLabel($label)
    {
        $this->myCache = null; //Changes made so reset cache
        $this->label = (string) $label;
    }
    
    protected function getSomeComputedData()
    {
        return uniqid($this->label, true);
    }
    
    public function &fetch()
    {
        $add = "";
        foreach ($this->append as $string)
        {
            $substring = $string->fetch(); //Could be cached, but we don't worry, we let the part deal with that
            $add .= " " . $substring;
        }

        if ($this->myCache !== null)
        {
            $this->fullCache = $this->myCache . $add;
            return $this->fullCache;
        }
                
        $this->myCache = $this->getSomeComputedData();
        $this->fullCache = $this->myCache . $add;
        return $this->fullCache;
    }

    public function add(StringMaker $string)
    {
        $this->append[] = $string;
    }
}

$parent_str = new StringMaker("foo");
$child_str = new StringMaker("bar");
$parent_str->add($child_str);

echo "<br />First computation:<br />";
echo $parent_str->fetch();

echo "<br />No changes, second computation:<br />";
echo $parent_str->fetch();

$parent_str->setLabel("xxx");

echo "<br />Changed label in parent, third computation:<br />";
echo $parent_str->fetch();

$child_str->setLabel("yyy");

echo "<br />Changed label in child, but not in parent, fourth computation:<br />";
echo $parent_str->fetch();

Code: Select all

First computation:
foo45698d3e75b659.92261672 bar45698d3e75b538.94718683
No changes, second computation:
foo45698d3e75b659.92261672 bar45698d3e75b538.94718683
Changed label in parent, third computation:
xxx45698d3e75b7e6.26268698 bar45698d3e75b538.94718683
Changed label in child, but not in parent, fourth computation:
xxx45698d3e75b7e6.26268698 yyy45698d3e75b8a0.38707792
Notice only the bits which have been changed are re-computed? But the number of places I'm storing the data is just silly.

I'm still thinking so if I come up with any ideas I'll post back :)

Posted: Sun Nov 26, 2006 7:28 am
by feyd
Is it expensive to traverse the tree?

Posted: Sun Nov 26, 2006 7:34 am
by Chris Corbyn
feyd wrote:Is it expensive to traverse the tree?
Not if each part doesn't need to be "built" all over again no. Traversing the tree purely to grab what's already computed for that branch would be fine :)

Perhaps you're thinking along the lines of what I'm considering although I'm not sure how it will work just yet. I was thinking of making the cache a separate object which contains a multi-dimensional array, each level of which represents a branch in the tree and when you glue it all together you have the full document. This way you can modify a branch independently and because the ache is not contained in each object there's no duplication of data. Sounds a bit tricky but fun :)

Was that were you were heading?

EDIT | Ok, I make a static property in the "Part" class which increments each time it's instantiated. This can be an ID. The branches in the cache can then be represented by this ID as the key in the array. If it's null, the object with the ID of that key must be re-built, otherwise that cached data is returned.

Posted: Sun Nov 26, 2006 7:50 am
by feyd
I was considering something slightly different, but your idea will likely work better. Separating them would allow a caching mechanism to be a plug-in to the tree then, that could also lead to transformation objects also being plug-ins and other things that would manipulate the data.

Posted: Sun Nov 26, 2006 7:59 am
by Chris Corbyn
While I'm thinking about that separate cache it's just occurred to me that maybe I've already got that sort of a structure with the nested objects. If I drop the fullCache and just return the value, PHP's garbage collection will get rid of whatever the child used in computation. I think in the most flexible scenario I'll be using twice the amount of data as the full document (one for the return value of the fully built structure, and one for the total size of each cached part combined). That may not be too bad... I'll have to do some XDebug profiling to get a better idea. I'll see how I go here :)

With previous versions of what I'm doing I think I was using ~4x the amount of memory so this will still be a huge improvement :)

Posted: Sun Nov 26, 2006 10:47 am
by Chris Corbyn
Ermm, there's something whacky with my implementation in the actual code.

I removed the cache completely and saved 4x the amount of time. Without the cache it uses barely any less memory (about 5% less) yet it takes a 4th of the time to build. Hmm....

It's working out roughly 0.01 seconds to build a document with a plain text part, a set of headers and a 250KB attachment, base64 encoded. Considering the time it will then take to pass that over a network I'm not going to cry about it :) I guess this whole thread was a little pointless then :P ;)