Updated “Metal Introduction” Document.

I’ve added a section on showing the processes in Metal for implementing Constructive Solid Geometry on the fly using the algorithms outlined in the paper An improved z-buffer CSG rendering algorithm, with example code uploaded to GitHub.

My goal is to eventually turn this into a CSG library for Metal.

The updated paper (with the additional section) can be downloaded from here: Metal: An Introduction, and updates the paper from my prior post.

And when you put it all together you should get:

ScreenShot2

OCTools

OCTools is a suite of tools which serve as a plug-in replacement for yacc and lex. The goal
is to provide a (roughly) source compatible tool which can convert yacc and lex
grammars into Objective C output for building parsers that run on MacOS and iOS.

The source kit can be found on GitHub.

Both tools can be compiled on the Macintosh using Xcode, and generate a command-line tool which can be run from the terminal or included into an Xcode project. (Both tools are built in C, so they should be portable to other platforms; however, I haven’t done the port so I’m not sure how successful porting would be.)

The goal of this project was to create Yacc and Lex analogs which generate re-entrant Objective C classes which implement the parsers, and for the Lex analog to use an Objective-C protocol definition for the input file, for maximum flexibility.


More information can be found here: OCTools.

Constructing a cyclic polygon given the edge lengths.

So I have a problem: given the length of the edges L = {l0, l1, … lN} of an N-sided irregular polygon with N > 3, I need to construct the values for the radius R and the angles A = {a0, a1, … aN} such that the points P = { R, ai } form a closed polygon with the lengths given.

I searched through the internet and found all sorts of articles on the subject, but a day of searching and I was unable to find a way that I could construct the values R and A. I’m sure it’s out there, but I figured it’d be a good exercise to do it myself.

Based on my reading apparently there is no current solution to the problem–so I went ahead and built a simple algorithm to construct the values. The idea is outlined below.


Before we continue, we assume the longest length of our polygon is l0. We can rotate the edges of the polygon if this is not the case.

Observation 1: The vertices P = { p0, p1, … pN } of our cyclic polygon may be described by the values pi = ( R, ai ) in polar coordinates:

Circumpolar

By observation, the sum of the angles A is 360°.

Hypothesis 1: For any set of edge lengths L, there is only one set of values for R and A which create a cyclic polygon.

Sketch of proof: By construction.

Hypothesis 2: In constructing a polygon with lengths L, we have two cases for the centroid of the circumscribed circle that encompasses the polygon. Either the centroid is inside the polygon:

CenterInside

Or the centroid is outside the polygon.

CenterOutside

If the centroid of the circumscribed circle is outside the polygon, it is outside the edge with the largest length, and it is only outside one of the edges.

Sketch of proof: If the centroid of the circumscribed circle is outside of one of the edges, the edge has the property that the associated angle ai has an angle greater than 180°. If the centroid is outside two edges, this implies the sum of the angles exceeds 360°, violating Observation 1 above.

Hypothesis 3: The value of R must be greater than l0/2, where l0 is the length of the longest edge.

Sketch of proof: The lower bound of R is determined by an edge which passes directly through the center of the circle. If the edge does not pass through the center of the circle, then the diameter must be larger than the specified edge.

Observation 2: Given an edge in a circumscribed polygon with length li and a radius R, the angle associated with the edge ai can be calculated as:

NewImage

For an edge where the center of the circumscribed circle is outside the edge, the angle a is between 180° and 360°.

Sketch of proof: Law of cosines.


With these observations we could construct our polygon by searching for the proper value of R.

The essence of our algorithm is to search for a value R such that the following function is zero:

NewImage

One issue we need to deal with is calculating the angle on the longest line segment of our polygon. We can handle this case by starting with the angle a0; the idea is that we know a0 is in the range (0,360). From the angle a0 we can calculate r using the formula:

NewImage

Sketch of proof: Geometric observation.

We now can construct a function F(a) = Σai, by calculating the angles of all of the edges in our polygon for a proposed radius r, calculated using the value a as the angle a0. (If the angle is greater than 180°, this fits hypothesis 2 above.)

Hypothesis 5: The function F(a) is monotonic on the value a.

Sketch of proof: A geometric argument; as R increases, the angles decrease, while as R decreases, the angles increase. The increase or decrease in angles are monotonic, so the sum is monotonic. The function R relating to a is also monotonic.

Hypothesis 6: The function F(a) – 360 has one zero.

Sketch of proof: Trivial observation based on the above hypothesis.


We can now sketch the code which searches for the zero given a value a. We can use a number of computational methods to search for the zero of F(a) – 360; in our code below we use a simple binary subdivision search to search for a with 0 < a < 360°.


Preliminary functions

Given an object with the following declaration:

#define EPSILON		1e-6			/* A very small value. */

@interface KTCyclicPolygon ()
{
    NSInteger n;            // # edges
    NSInteger *edges;       // length of edge

    NSInteger ledge;        // index of maximum length edge
}

We can implement our algorithm by first, finding the edge that has the maximum length:

- (NSInteger)largestEdge
{
    NSInteger i = 0;
    NSInteger len = edges[0];

    for (NSInteger j = 1; j  len) {
            len = edges[j];
            i = j;
        }
    }
    return i;
}

We need a way to calculate the angle of an edge given a radius:

- (double)angleAtIndex:(NSInteger)index radius:(double)r
{
    // Calculate a from observation 2

    double l = edges[index];
    double c = 1 - (l*l)/(2*r*r);
    return acos(c);         // output from 0 to M_PI
}

Then we can calculate f(a) above:

- (double)f:(double)a
{
    /*
     *  Calculate r from a.
     */

    double l = edges[ledge];
    double m = l / 2;
    double r = m / sin(a/2);

    /*
     *  Calculate the angles and sum from 1 to N
     */

    double sum = a;

    /*
     *  Sum the angles from 1 to N
     */

    for (NSInteger i = 0; i < n; ++i) {
        if (i == ledge) continue;
        sum += [self angleAtIndex:i radius:r];
    }

    /*
     *  Now subtract from our zero.
     *
     *      If the angle is small (meaning we selected too small a start angle)
     *  then this value will be negative. If we picked too large a start
     *  angle (meaning a was too big), then the sum will be large and the
     *  value will be positive.
     */

    return sum - 2*M_PI;                // Return value is angle gap to close.
}

At this point we can easily calculate a value a between 0 and 360°:

- (BOOL)calculateCyclicPolygon
{
    if (n <= 3) return NO;      // Need a 4 sided object or larger

    /*
     *  Step 1: Find the largest edge
     */

    ledge = [self largestEdge];

    /*
     *  Step 2: given a min/max value of 0, 360, find a between those values
     *  which approach our error
     */

    double mina = 0;
    double maxa = M_PI*2;
    double mida;

    for (;;) {
        mida = (mina + maxa)/2;

        double f = [self f:mida];

        if (fabs(f) < EPSILON) break;       // a solves our solution
        if (f < 0) {
            // Angle too small; we need a larger angle
            mina = mida;
        } else {
            // Angle too large; we need a smaller angle
            maxa = mida;
        }
    }

    double a = mida;                        // to within epsilon.

At this point the angle a has been found to within the error specified in our constant EPSILON.

We can validate the correctness of our algorithm by calculating the radius, and the angles for the rest of the items. We can then calculate the polar coordinates of each of the edges and compare the lengths with the stored edge lengths:

    double l = edges[ledge];
    double m = l / 2;
    double r = m / sin(a/2);

    double x = r;
    double y = 0;
    double angle = 0;
    for (NSInteger i = 0; i < n; ++i) {
        if (i == ledge) {
            angle += a;
        } else {
            angle += [self angleAtIndex:i radius:r];
        }
        double xn = cos(angle) * r;
        double yn = sin(angle) * r;

        double dx = xn - x;
        double dy = yn - y;
        double d = sqrt(dx * dx + dy * dy);

        printf("Vertex: %g %gn",x,y);
        printf("  Edge %d: %d == %gn",(int)i,(int)edges[i],d);

        x = xn;
        y = yn;
    }
    printf("Vertex: %g %gn",x,y);
}

When run this will find the initial angle a which converges to our solution. Running with input values 1, 2, 3 and 4, we get the output:

Vertex: 2.0026 0
  Edge 0: 1 == 1
Vertex: 1.75293 0.96833
  Edge 1: 2 == 2
Vertex: 0.0408695 2.00219
  Edge 2: 3 == 3
Vertex: -1.9922 -0.203859
  Edge 3: 4 == 4
Vertex: 2.0026 -3.40378e-07

This shows our vertices define a closed cyclic polygon with edges of lengths 1, 2, 3 and 4–to within our error tolerance.

Finding the boundary of a one-bit-per-pixel monochrome blob

Recently I’ve had need to develop an algorithm which can find the boundary of a 1-bit per pixel monochrome blob. (In my case, each pixel had a color quality which could be converted to a single-bit ‘true’/’false’ test, rather than a literal monochrome image, but essentially the problem set is the same.)

In this post I intend to describe the algorithm I used.


Given a 2D monochrome image with pixels aligned in an X/Y grid:

Arbitrary 2D Blob

We’d like to find the “boundary” of this blob. For now I’ll ignore the problem of what the “boundary” is, but suffice to say that with the boundary we can discover which pixels are inside the blob which border the exterior of the blob, which pixels on the outside border the pixels inside the blob, or to find other attributes of the blob such as a rectangle which bounds the blob.

If we treat each pixel as a discrete 1×1 square rather than a dimensionless point, then I intend to define the boundary as the set of (x,y) coordinates which define the borders around each of those squares, with the upper-left corner of the square representing the coordinate of the point:

Pixel Definition

So, in this case, I intend to define the boundary as the set of (x,y) coordinates which define the border between the black and the white pixels:

outlined blob

We do this by walking the edge of the blob through a set of very simple rules which effectively walk the boundary of the blob in a clock-wise fashion. This ‘blob walking’ only requires local knowledge at the point we are currently examining.

Because we are walking the blob in a clock-wise fashion, it is easy to find the first point in a blob we are interested in walking: through iteratively searching all of the pixels from upper left to lower right:

(Algorithm 1: Find first point)

-- Given: maxx the maximum width of the pixel image
          maxy the maximum height of the pixel image

   Returns the found pixel for an eligable blob or 'PixelNotFound'.

for (int x = 0; x < maxx; ++x) {
    for (int y = 0; y < maxy; ++y) {
        if (IsPixelSet(x,y)) {
            return Pixel(x,y);
        }
    }
}
return PixelNotFound;

Once we have found our elegable pixel, we can start walking counter clockwise, tracking each of the coordinates we have found as we walk around the perimeter of the blob, traversing either horizontally or vertically.


Given the way we’ve encountered our first pixel in the algorithm above, the pixels around the immediate location at (x,y) looks like the following:

Point In Center

That’s because the way we iterated through, the first pixel we encountered at (x,y) implies that (x-1,y), (x,y-1) and (x-1,y-1) must be clear. Also, if we are to progress in a clock-wise fashion, clearly we should move our current location from (x,y) to (x+1,y):

Blob5


The design of our algorithm proceeds in a similar fashion: by examining each of the 16 possible pixel configurations we can find in the surrounding 4 pixels, and tracking the one of 4 possible incoming paths we take, we can construct a matrix of directions in which we should take to continue progressing in a clockwise fashion around our eligible blob. And in all but two configuration cases, there was only one possible incoming path we could have taken to get to the center point, since as we presume we are following the edge of the blob we could not have entered between two black pixels or between two white pixels. Some combinations are also illegal since we presume we are walking around the blob in a clock-wise fashion rather than in a counter-clockwise fashion. (This means that we should be, when standing at the point location, pixels should be on the right and never on the left. There is a proof for this which I will not sketch here.)

The 16 possible configurations and the outgoing paths we can take are illustrated below:

All directions

Along the top of this table shows the four possible incoming directions: from the left, from the top, from the right and from the bottom. Each of the 16 possible pixel combinations are shown from top to bottom, and blanks indicate where an incoming path was illegal–either because it comes between two blacks or two whites, or because the path would have placed the black pixel on the left of the incoming line rather than on the right.

Note that with only two exceptions each possible combination of pixels produces only one valid outgoing path. Those two exceptions we arbitrarily pick an outgoing of two possible paths which excludes the diagonal pixel; we could have easily gone the other way and included the diagonal, but this may have had the property of including blobs with holes. (If this is acceptable or not depends on how you are using the algorithm.)

This indicates that we could easily construct a switch statement, converting each possible row into an integer from 0 to 15:

Algorithm 2: converting to a pixel value.

-- Given: IsPixel(x,y) returns true if the pixel is set and false 
          if it is not set or if the pixel is out of the range from
          (0,maxx), (0,maxy)
   Return an integer value from 0 to 15 indicating the pixel combination

int GetPixelState(int x,y int y)
{
    int ret = 0;
    if IsPixel(x-1,y-1) ret |= 1;
    if IsPixel(x,y-1) ret |= 2;
    if IsPixel(x-1,y) ret |= 4;
    if IsPixel(x,y) ret |= 8;
    return ret;

}

We now can build our switch statement:

Algorithm 3: Getting the next pixel location

-- Given: the algorithm above to get the current pixel state,
          the current (x,y) location,
          the incoming direction dir, one of LEFT, UP, RIGHT, DOWN

   Returns the outgoing direction LEFT, UP, RIGHT, DOWN or ILLEGAL if the
   state was illegal.

Note: we don't test the incoming path when there was only one choice. We
could, by adding some complexity to this algorithm, for testing purposes.
The values below are obtained from examining the table above.

int state = GetPixelState(x,y);
switch (state) {
    case 0:    return ILLEGAL;
    case 1:    return LEFT;
    case 2:    return UP;
    case 3:    return LEFT;
    case 4:    return DOWN;
    case 5:    return DOWN;
    case 6:    {
                   if (dir == RIGHT) return DOWN;
                   if (dir == LEFT) return UP;
                   return ILLEGAL;
               }
    case 7:    return DOWN;
    case 8:    return RIGHT;
    case 9:    {
                   if (dir == DOWN) return LEFT;
                   if (dir == UP) return RIGHT;
                   return ILLEGAL;
               }
    case 10:   return UP;
    case 11:   return LEFT;
    case 12:   return RIGHT;
    case 13:   return RIGHT;
    case 14:   return UP;
    case 15:   return ILLEGAL;
}

From all of this we can now easily construct an algorithm which traces the outline of a blob. First, we use Algorithm 1 to find an eligible point. We then set ‘dir’ to UP, since the pixel state we discovered was pixel state 8, and the only legal direction had we been tracing around the blob was UP.

We store away the starting position (x,y), and then we iterate through algorithm 3, getting new directions and updating (x’,y’), traversing (x+1,y) for RIGHT, (x,y+1) for UP, (x-1,y) for LEFT and (x,y-1) for DOWN, until we arrive at our starting point.

As we traverse the parameter, we could either build an array of found (x,y) values, or we could do something else–such as maintaining a bounding rectangle, bumping the boundaries as we find (x,y) values that are outside the rectangle’s boundaries.

Something Funny Happened To Me On The Way To Release.

So I started playing with parsing Java class files, creating a cross compiler capable of converting Java class files into Objective C files. I even had a sufficient amount of Apache Harmony running so I could use a good part of the java.lang and java.util classes; roughly in parity with the GWT cross compiler that can compile Java class files into Javascript.

Then Apple dropped the “no cross compiling” bombshell.

Now, keep in mind that I’m just me, tinkering on my spare time during weekends. I don’t have the desire or the time to go up against Apple. I’d rather allow the XMLVM project (which has a well established ecosystem, or so it seems) to decide to go (or not go) against Apple’s wishes.

Then time went by, and I sort of lost interest in this thing.

So I’ve taken the liberty to post the source code here: the Java to Objective C Compiler sources, and the J2OC RTL, which contains a subset of the Apache Harmony project, and implementing the java.lang and java.util classes.

It’s been an interesting project, and hopefully in the next few weeks I’ll document how this all works–including the wierdnesses and pitfalls I came across with the Java VM to get Apache Harmony to work. (Nothing like working through a very large collection of class files to find all the fringe cases.) The output code was intended to be human readable–but it really isn’t for some expressions.

But I’ll describe that in the next few weeks.

And at some point I’ll post an example iPhone application which includes Java code.

Note that my approach was different than the XMLVM project. Instead of providing Java bindings of the iOS libraries, my intent was to only allow the compilation of a computational kernel, then have the user provide the UI elements separately for Android, the iPhone, the iPad, and whatever other target the code was to compile for.

So you won’t find a turn-key solution for recompiling Android code and have it run on the iPhone. You should really check out the XMLVM project instead.

All this code, by the way, is being published under a BSD style license: go ahead and use the code, but leave me out of it and don’t blame me if it goes haywire.

separator.png

While I don’t intend to get into the functioning of the compiler, I will give a taste of how the code works. The bulk of the .class file parser, which reads and loads the .class file data into memory, is contained in the class ClassFile in com.chaosinmotion.j2oc.vm. This class takes in its constructor an input stream opened to the first byte of a .class file, and loads the entire class file into memory.

Once read, the entire class file can be accessed using the getters associated with that class. The bulk of the code contained inside the .vm (and subpackages within .vm) are used to represent the contents of the class file. The .vm.data classes contain the various data types used to store the meta data within a class file (such as the method names, the attributes fields, and the like), and the .vm.code classes contain a code parser to convert the code within the .class files into an array of processed instructions.

Once the instructions are parsed (by the vm.code.Code class), the code in a method is represented as an array of code segments; a run of instructions that starts with an instruction first jumped into by another instruction, and terminates with either the end of the method or with a jump instruction. In other words, a CodeSeg (Code.CodeSeg class) is a section of instructions that always enters at the first instruction and executes sequentially to the last instruction in the segment. Additional information, such as the list of variables that are used when the segment is entered are noted; this is the current state of the Java operator stack as this segment is entered.

Ultimately the code parser and class file reader represents the code in a .class file in memory in an intermediate state that can then be used to write Objective C with the WriteOCMethod class (com.chaosinmotion.j2oc.oc). A class, CodeOptimize (.oc package) provides utilities that determine if code preambles must be written for memory management or for exception handling: memory management preamble does not need to be written if I never invoke another method. (This is the case for simple functions which return a field or does simple math.)

The theory is that in practice, it should be possible to replace the code writer method with a writer method capable of writing a different language, such as C++ or C.

separator.png

In the future, when I have more time, I’ll write more about the J2OC project. But for now, if there are any segments or parts you want to use or play with, be my guest.

Goodbye Far Clipping Plane.

I really wanted to write this up as a paper, perhaps for SigGraph. But I’ve never submitted a paper before, and I don’t know how worthy this would be of a SigGraph paper to begin with. So instead, I thought I’d write this up as a blog post–and we’ll see where this goes.

Introduction

This came from an observation that I remember making when I first learned about the perspective transformation matrix in computer graphics. See, the problem basically is this: the way the perspective transformation matrix works is to convert from model space to screen space, where the visible region of screen space goes from (-1,1) in X, Y and Z coordinates.

In order to map from model space to screen space, typically the following transformation matrix is used:

perspective.gif

(Where fovy is the cotangent of the field of view angle over 2, aspect is the aspect ration between the vertical and horizontal of the viewscreen, n is the distance to the near clipping plane, and f is the distance to the far clipping plane.)

As objects in the right handed coordinate space move farther away from the eye, the value of z increases to -∞, and after being transformed by this matrix, as our object approaches f, zs approaches 1.0.

Now one interesting aspect of the transformation is that the user must be careful to select the near and far clipping planes: the greater the ratio between far and near, the less effective the depth buffer will be.

If we examine how z is transformed into zs screen space:

derivematrix.gif

And if we were to plot values of negative z to see how they land in zs space, for values of n = 1 and f = 5 we get:

zpersgraph.png

That is, as a point moves closer to the far clipping plane, zs moves closer to 1, the screen space far clipping plane.

Notice the relationship as we move closer to the far clipping plane, the screen space depth acts as 1/z. This is significant when characterizing the accuracy of the representation of an object’s distance and the accuracy of the zs representation of that distance for drawing purposes.

If we wanted to eliminate the far clipping plane, we could, of course, derive the terms of the above matrix as f approaches ∞. In that case:

farclipinf1.gif

And we have the perspective matrix:

persmatrix2.gif

And the transformation from z to zs looks like:

zpersgraph2.png

IEEE-754

There are two ways we can represent a fractional numeric value. We can represent it as a fixed point value, or we can use a floating point value. I’m not interested here with a fixed point representation, only with a floating point representation of numbers in the system. Of course not all implementations of OpenGL support floating point mathematics for representing values in the system.

An IEEE 754 floating point representation of a number is done by representing the fractional significand of a number, along with an exponent.

ieee754.gif

Thus, the number 0.125 may be represented with the fraction 0 and the exponent -3:

ieee754ex.gif

What is important to remember is that the IEEE-754 representation of a floating point number is not accurate, but contains an error factor, since the fractional component contains a fixed number of bits. (23 bits for a 32-bit single-precision value, and 52 bits for a 64-bit double-precision value.)

For values approaching 1, the error in a floating point value is determined by the number of bits in the fraction. For a single-precision floating point value, the difference from 1 and the next adjacent floating point value is 1.1920929E-7, which means that as numbers approach 1, the error is of order 1.1920929E-7.

We can characterize the error in model space given the far clipping plane by reworking the formula to find the model space z based on zs:

zspacederive.png

We can then plot the error by the far clipping plane. If we assume n = 1 and zs = 1, then the error in model space zε for objects that are at the far clipping plane can be represented by:

zerrorderive.gif

Graphing for a single precision value, we get:

zerror.png

Obviously we are restricted on the size of the far clipping plane, since as we approach 109, the error in model space grows to the same size as the model itself for objects at the far clipping plane.

Clearly, of course, setting the far clipping plane to ∞ means almost no accuracy at all as objects move farther and farther out.

The reason for the error, of course, has to do with the representation of the number 1 in IEEE-754 mathematics. Effectively the exponent value for the IEEE-754 representation is fixed to 2-1 = 0.5, meaning as values approach 1, the fractional component approaches 2: the number is effectively a fixed-point representation with 24 bits of accuracy (for a single-precision value) from 0.5 to 1.0.

(At the near clipping plane the same can be said for values approaching -1.)

separator.png

All values in the representation range of IEEE-754 points have the same feature: as we approach the value, the representation is similar to if we had picked a fixed-point representation with 24 (or 53) bits. The only value in the IEEE-754 range which actually exhibits declining representational error as we approach that value is zero.

In other words, for values 1-ε, accuracy is fixed to the number of bits in the fractional component. However, for values of ε approaching 0, the exponent can decrease, allowing the full range of bits in the fractional component to maintain the accuracy of values as we approach zero.

With this observation we could in theory construct a transformation matrix which can set the far clipping plane to ∞. We can characterize the error for a hypothetical algorithm that approaches 1 (1-1/z) and one that approaches 0 (1/z):

zerrors.png

Overall, the error in model space of 1-1/z approaches the same size as the actual distance itself in model space as the distance grows larger: err/z approaches 1 as z grows larger. And the error grows quickly: the error is as large as the position in model space for single precision values as the distance approaches 107, and the error approaches 1 for double precision values as z approaches 1015.

For 1/z, however, the ratio of the error to the overall distance remains relatively constant at around 10-7 for single precision values, and around 10-16 for double-precision values. This suggests we could do away without a far clipping plane; we simply need to modify the transformation matrix to approach zero instead of 1 as an object goes to ∞.

Source code:

The source code for the above graph is:

public class Error
{
    public static void main(String[] args)
    {
        double z = 1;
        int i;
        
        for (i = 0; i < 60; ++i) {
            z = Math.pow(10, i/3.0d);
            
            for (;;) {
                double zs = 1/z;
                double zse = Double.longBitsToDouble(Double.doubleToLongBits(zs) - 1);
                double zn = 1/zse;
                double ze = zn - z;

                float zf = (float)z;
                float zfs = 1/zf;
                float zfse = Float.intBitsToFloat(Float.floatToIntBits(zfs) - 1);
                float zfn = 1/zfse;
                float zfe = zfn - zf;

                double zs2 = 1 - 1/z;
                double zse2 = Double.longBitsToDouble(Double.doubleToLongBits(zs2) - 1);
                double z2 = 1/(1-zse2);
                double ze2 = z - z2;

                float zf2 = (float)z;
                float zfs2 = 1 - 1/zf2;
                float zfse2 = Float.intBitsToFloat(Float.floatToIntBits(zfs2) - 1);
                float zf2n = 1/(1-zfse2);
                float zfe2 = zf2 - zf2n;
                
                if ((ze == 0) || (zfe == 0)) {
                    z *= 1.00012;   // some delta to make this fit
                    continue;
                }

                System.out.println((ze/z) + "t" + 
                        (zfe/zf) + "t" + 
                        (ze2/z) + "t" + 
                        (zfe2/zf));
                break;
            }
        }
        
        for (i = 1; i < 60; ++i) {
            System.out.print(""1e"+(i/3) + "",");
        }
    }
}

We use the expression Double.longBitsToDouble(Double.doubleToLongBits(x)-1) to move to the previous double precision value (and the same with Float for floating point values), repeating (with a minor adjustment) in the event that floating point error prevents us from propery calculating the error ratio at a particular value.

A New Perspective Matrix

We need to formulate an equation for zs that crosses -1 as z crosses n, and approaches 0 as z approaches -∞. We can easily do this by the observation from the graph above: instead of calculating

zoldformula.gif

We can simply omit the 1 constant and change the scale of the 2n/z term:

znewformula.gif

This has the correct property that we cross -1 at z = -n, and approach 0 as z approaches -∞.

znewgraph.png

From visual inspection, this suggests the appropriate matrix to use would be:

newpersmatrix.gif

Testing the new matrix

The real test, of course, would be to create a simple program that uses both matrices, and compares the difference. I have constructed a simple program which renders two very large, very distance spheres, and a small polygon in the foreground. The large background sphere is rendered with a diameter of 4×1012 units in radius, at a distance of 5×1012 units from the observer. The smaller sphere is only 1.3×1012 units in radius, embedded into the larger sphere to show proper z order and clipping. The full sphere (front and back) are drawn.

The foreground polygon, by contrast, is approximately 20 units from the observer.

I have constructed a z-buffer rendering engine which renders depth using 32-bit single-precision IEEE-754 floating point numbers to represent zs. Using the traditional perspective matrix, the depth values become indistinguishable from each other, as their values approach 1. This results in the following image:

rendertest_image_err.png

Notice the bottom half of the sphere is incorrectly rendered, as is large chunks of the smaller red sphere.

Using the new perspective matrix, and this error does not occur in the final rendered product:

rendertest_image_ok.png

The code to render each is precisely the same; the only difference is the perspective matrix:

public class Main
{
    /**
     * @param args
     */
    public static void main(String[] args)
    {
        Matrix m = Matrix.perspective1(0.8, 1, 1);
        renderTest(m,"image_err.png");
        
        m = Matrix.perspective2(0.8, 1, 1);
        renderTest(m,"image_ok.png");
    }

    private static void renderTest(Matrix m, String fname)
    {
        ImageBuffer buf = new ImageBuffer(450,450);
        m = m.multiply(Matrix.scale(225,225,1));
        m = m.multiply(Matrix.translate(225, 225, 0));
        
        Sphere sp = new Sphere(0,0,-5000000000000d,4000000000000d,0x0080FF);
        sp.render(m, buf);
        
        sp = new Sphere(700000000000d,100000000000d,-1300000000000d,300000000000d,0xFF0000);
        sp.render(m, buf);
        
        Polygon p = new Polygon();
        p.color = 0xFF00FF00;
        p.poly.add(new Vector(-10,-3,-20));
        p.poly.add(new Vector(-10,-1,-19));
        p.poly.add(new Vector(0,0.5,-22));
        p = p.transform(m);
        p.render(buf);
        
        try {
            buf.writeJPEGFile(fname);
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Notice in the call to main(), we first get the traditional perspective matrix with the far clipping plane set to infinity, then we get the alternate matrix.

The complete sources for the rendering test which produced the above images, including custom polygon renderer, can be found here.

With this technique it would be possible to render correctly large landscapes with very distant objects without having to render the scene twice: once for distant objects and once for near objects. To use this with OpenGL would require adjusting the OpenGL pipeline to allow the far clipping plane to be set to 0 instead of 1 in zs space. This could be done with the glClipPlane call.

Conclusion

For modern rendering engines which represent the depth buffer using IEEE-754 (or similar) floating point representations, using a perspective matrix which converges to 1 makes little sense: as values converge to 1, the magnitude of the error is similar to that of a fixed-point representation. However, because of the nature of the IEEE-754 floating point representation, convergence to 0 has much better error characteristics.

Because of this, a new perspective matrix than the one commonly used should have better rendering accuracy, especially if we change the far clipping plane to ∞.

By using this new perspective matrix we have demonstrated a rendering environment using 32-bit single-precision floating point values for a depth buffer which is capable of representing in the same scene two objects whose size differs by 11 orders of magnitude. We have further shown that the error in representation of the zs depth over the distance of an object should remain linear–allowing us to have even greater orders of magnitude difference in the size of objects. (Imagine rendering an ant in the foreground, a tree in the distance, and the moon in the background–all represented in the correct size in the rendering system, rather than using painter’s algorithm to draw the objects in order from back to front.)

Using this matrix in a system such as OpenGL, for rendering environments that support floating point depth buffers, would be a matter of creating your own matrix (rather than using the built in matrix in the GLU library), and setting a far clipping plane to zs = 0 instead of 1.

By doing this, we can effectively say goodbye to the far clipping plane.

Addendum:

I’m not sure but I haven’t seen this anywhere else in the literature before. If anyone thinks this sort of stuff is worthy of SigGraph and wants to give me any pointers on cleaning up and publishing, I’d be greatful.

Thanks.

On Memory and Memory Management

This will be a bit of an introductory article on memory, written in part for my wife and for anyone else who may find such an introduction useful.

In The Beginning…

In the beginning there was RAM, random access memory. A microprocessor which could execute instructions would be attached to that RAM, which could be accessed by special instructions which operated on the contents of that memory.

Some microprocessors (for example, the Z-80) used special instructions which would use two 8-bit integer registers (such as register D and E) to form a 16-bit address into RAM, or would use special registers (the IX and IY index registers) to access memory. Others (such as the 68000 CPU) would have a bank of dedicated address registers which are used to access memory.

Writing software in this era was simple: you would define the location of various objects that would be stored in memory, and you would use those locations to access the stored data.

In today’s modern parlance all objects are fixed and global. There is no allocating objects or releasing those; that came later. Instead, there is just picking the size of the records (or structures) stored in memory, and making sure there is enough space left in RAM for your stack.

Since early microprocessors only had a very small and limited amount of memory, this is fine. Embedded development with toolchains such as the Small Device C Compiler, which can compile code for the HC08 series of CPUs, don’t generally provide any way to allocate memory; instead, you declare all of your objects as global structures.

separator.png

As an example, a program I’ve been tinkering with on the Arduino platform (and I really believe every high school student who wants to get into programming needs one) emulates a calculator with no memory. The memory declaration looks something like:

/*  Calculator Storage in RAM */
Double GDisplay;
Double GInput;
Double GScratch;
boolean GIsInput;
boolean GIsClear;

When compiled it will compile into an assembler statement that may look something like:

.EQU GDisplay = 0x0100
.EQU GInput = 0x0108
.EQU GScratch = 0x0110
.EQU GIsInput = 0x0118
.EQU GIsClear = 0x0119

Fixed allocation of memory structures also makes sense with certain types of game development, where performance and reliability of the code path is essential. For example, a 3D game such as the early game Descent could store for each level the size of the rendering map and the maximum number of records needed to process and render a level regardless of your location in the maze. Then, when the level loads, the proper amount of memory can be obtained using a function call similar to sbrk, which tells the operating system that you need a fixed amount of RAM, then partition the contents up yourself.

Various compiled games also use this technique, where the game is specified by a specialized game specification language. By being able to walk through all potential states in a game, it would be possible to determine the maximum memory footprint for that game, then request the fixed amount of memory for storage. This technique is used by Zork, amongst other things.

Fixed allocation of memory can also be used for certain high-performance applications where accuracy and speed and stability are paramount. For example, software handling incoming click requests in a search advertising system may wind up handing millions (or even billions) of click requests per minute–given that each click request can be designed as a fixed-sized record (or a record with a maximum known size), click requests can be accumulated into a fixed size array with certain summary data accumulated in global variables during store. Then, when a maximum time (or a maximum number) is reached, the formatted buffer and summary data can be spilled into a single write request into upstream systems which may only need the summary data.

Then Came The Heap.

Not all programs work well with the fixed allocation of objects. Sometimes you need a linked list of objects, or a heap list, or other more complex data structures where you cannot know a priori the number of records you will be handling.

Heap processing more or less works the same regardless of the programming language: there is a mechanism by which you can request a chunk of memory at a given size, and another call to release that memory. Sometimes there is a call that allows you to resize that memory block; resizing a memory block can be handled by releasing the old block and allocating a new one, copying the contents from one location to another.

Heap allocation works by using a large chunk of RAM dedicated to that heap, and, on a request for a chunk of memory, reserves it in RAM. There are many ways this can be done; the easiest to explain is the greedy algorithm, which reserves the first chunk of memory that can be found.

Allocated memory requires some bookkeeping information so the memory allocation code can know how to handle this chunk of memory: is it allocated, how big is the chunk that is reserved. So when you allocate memory, generally additional space is reserved prior to the address pointer with bookkeeping data; this is then used to resize memory (if resizing is allowed), and to know how much memory to mark as free when the free routine is called.

allocation.png

A simple malloc() and free()

We can easily implement a simple malloc and free routine to illustrate how a typical heap may work. We’ll make two assumptions, which are common to today’s modern processors. First, all allocated memory will be aligned (for efficiency sake) on a 16-byte boundary. (Some processors address things on 16-byte boundaries far more efficiently.) Second, we will use a four byte bookkeeping object which gives the size of the memory allocated, along with a 1 bit flag indicating if the area of memory is free or still allocated. Because we know all memory is aligned on a 16 byte boundary we know the least significant bit of the 32-bit length will never be used, so we use that bit for the free flag; this way we only need to use 4 bytes for our bookkeeping data.

By sketching out the code for a simple malloc() and free() we can also illustrate some of the interesting bugs that can come up while using the heap, such as accidentally using memory that was freed previously.

separator.png

Footnote: We use ‘uint8_t’ to refer to 8-bit bytes, and ‘uint32_t’ to refer to 32-bit bytes. Because on most modern operating systems, memory can be addressed on a byte boundary, we can add or subtract from an address pointer by converting that pointer to a pointer to a byte object. For example:

a = (uint32_t *)(((uint8_t *)a) + 3);

The expression above will add 3 to the address register in ‘a’, pointing three bytes in from where it was pointing before.

Notice if we were doing this for a microprocessor with 16-bit addresses, we could use a uint16_t, or 2 byte integer, for the bookkeeping area instead.

separator.png

Before we can use the memory, we must initialize the memory heap area to note that the entire heap is free. This means we need to set up a bookkeeping header that indicates that all of memory (minus the area for bookkeeping) is free. We also need to make sure the free memory itself is aligned on a 16 byte boundary–which means we must waste the first 12 bytes of memory. More advanced memory allocation schemes may make use of that area for other bookkeeping information, but for now we simply waste the memory.

When initialized our RAM area will look like:

memfreeheap.png

Our initialization routine will look like:

/*	initialize_ram
 *
 *		Given the block of ram memory, we mark the entire 
 *	block as free. We do this by setting up free block 16 
 *	bytes in; we do this so that the first allocated block
 *	will be aligned on a 16 byte boundary. This wastes the
 *	bottom 12 bytes of memory which could, for a more
 *	sophisticated system, be used for other stuff.
 *
 *		We have GRam point to the first byte of RAM, and
 *	RAMSIZE is the size of the RAM area being managed.
 */

void initialize_ram()
{
	/* Find the location of the bookkeeping block for this.
	 * The address is 12 bytes in; 16 bytes for alignment 
	 * minus 4 bytes for the bookeeping area
	 */
	
	uint32_t *bookkeeping = (uint32_t *)(((uint8_t *)GRam) + 12);
	
	/*
	 *	Now mark the memory as free. The free size is 16
	 *	bytes smaller than the max ram size, and mark it
	 *	as free. We assume RAMSIZE is divisible by 16.
	 */
	
	*bookkeeping = (RAMSIZE - 16) | 1;
	
	/*
	 *	And finally mark the end bookkeeing block. Because of the way
	 *	we allocate memory, the top 4 bytes must be specially
	 *	marked as a reserved block. That's so we know we have
	 *	hit the top.
	 */
	
	bookkeeping = (uint32_t *)(((uint8_t *)GRam) + RAMSIZE - 4);
	*bookkeeping = 4;		// marked as reserved 4 bytes.
}

Free is simple. Because our convention is that a block of memory is considered free if the least significant bit of the 32-bit bookkeeping area is set, free simply sets that bit.

/*	free
 *
 *		Free memory. This works by finding the bookkeeping block
 *	that is 4 bytes before the current pointer, and marking the
 *	free bit.
 */

void free(void *ptr)
{
	/*
	 *	Subtract 4 bytes from the current pointer to get the
	 *	address of the bookkeeping memory
	 */
	
	uint32_t *bookkeeping = (uint32_t *)(((uint8_t *)ptr) - 4);
	
	/*
	 *	Mark the memory as free by setting the lowest bit
	 */
	
	*bookkeeping |= 1;
}

Most of the meat of our memory management system will be in the alloc call. This has to scan through all the free blocks, finding the first chunk of memory that can be reserved. We also, by convention, return NULL if there is no memory left that can be allocated.

The first thing our allocation routine does is find out how big a chunk of memory we need to reserve. Because we must guarantee everything is aligned on a 16-byte boundary, this means a request for 2 bytes must reserve 16 bytes; that way, the next allocation request will also be aligned correctly. (More sophisticated memory management systems may know enough to subdivide memory for smaller allocation requests; we don’t do that here.)

So we must calculate the size, including the extra 4 bytes needed for our bookkeeping block:

	/*
	 *	Step 1: Change the size of the allocation to align
	 *	to 16 bytes, and add in the bookkeeping chunk that
	 *	we allocate. Our pointer will return the first
	 *	byte past the bookkeeping memory.
	 */
	
	size = size + 4;		// Add bookkeeping
	if (0 != (size % 16)) size += 16 - (size % 16);	// align

Next, we need to scan through memory from the first block in memory, searching for a collection of free blocks which are big enough for us to reserve for our requested block.

	/*
	 *	Step 2: scan to find a space where this will fit.
	 *	Notice that we may have to piece together multiple
	 *	free blocks, since our free doesn't glue together
	 *	free blocks.
	 */
	
	ptr = (uint32_t *)(((uint8_t *)GRam) + 12);
	end = (uint32_t *)(((uint8_t *)GRam) + RAMSIZE);
	while (ptr < end) {
		if (0 == (1 & *ptr)) {
			/* Lowest bit is clear; this is allocated memory. */
			/* The bookkeeping area holds the total size of the */
			/* allocated area in bytes; thus, we can skip to */
			/* the next block by adding the bookkeeping area */
			/* to the current block. */
			ptr = (uint32_t *)(((uint8_t *)ptr) + *ptr);
		} else {
			/*
			 *	This area is free. Note that this is found, then
			 *	start scanning free areas until we piece together
			 *	something big enough to fit my request
			 */
			
			found = ptr;
			asize = *ptr & ~1UL;		// Get the size, clearing free bit
			ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));	// next block
			while ((asize < size) && (ptr < end)) {
				if (0 == (1 & *ptr)) {
					/* We bumped against an allocated block of memory */
					/* Exit this loop and continue scanning */
					break;
				}
				
				/* Block is free. Add it up to this block and continue */
				asize += *ptr & ~1UL;
				ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));	// next block
			}
			if (asize = end) return NULL;		// Could not find free space

This gets a bit complicated.

First, we set ptr and end to point to the start block and the end of memory, respectively. Next, we walk through while our pointer has not reached the end, looking for free memory.

Now our free() routine simply marks the memory as free; it doesn’t gather up blocks of free memory and glue them together. So our allocation routine has to do the job. When we first encounter a free block, we note how much memory the free block represents in asize, and we put the start of that free block in found. We then continue to scan forward until we either piece enough memory together to satisfy the size we need, or until we find a reserved block which prevents us from using this chunk of free memory.

If we run out of memory: that is, if the pointer ptr goes past end, then we can’t satisfy this request, so we return NULL.

Now that we’ve found a chunk of memory that satisfies our request, we piece together the free blocks, breaking the last free block in the list of free blocks if needed.

When this chunk of code is reached, found points to the start of our free memory, and ptr points to the last free block in the list of free blocks that may need to be split.

We write the correct bookkeeping data to mark the memory as allocated, and if that is shy of the end of the free blocks, we then write a free block bookkeeping mark:

	/*
	 *	Step 3: mark the block as allocated, and split the last free block
	 *	if needed
	 */
	
	*found = size;						// mark the size we've allocated.
	if (size < asize) {
		/* We have more than enough memory. Split the free block */
		/* First, find the pointer to where the free block should go */
		ptr = (uint32_t *)(((uint8_t *)found) + size);
		
		/* Next, mark this as a free block */
		ptr = (asize - size) | 1;
	}

This works correctly because asize is the total size of the range of free blocks we just glued together, so (uint8_t *)found + asize will point to the next block of memory past the free memory we just found.

Now that we’ve peeled off a chunk of memory, we need to return the memory itself. Up until now our pointers have been pointing at the first byte of the 4-byte bookkeeping record; the memory we’re allocating is just past that 4 byte record. So we need to return the first byte of our allocated memory itself:

	
	/*
	 *	The found pointer points to the bookkeeping block. We need to
	 *	return the free memory itself, which starts with the first byte
	 *	past the bookkeeping mark.
	 */
	
	return (void *)(((uint8_t *)found) + 4);

Putting our alloc routine together we get:

/*	alloc
 *
 *		Allocate the requested memory by scanning the heap
 *	of free blocks until we find something that will fit.
 *	We then split the free blocks and return the allocated
 *	memory.
 *
 *		If we are out of memory, we return NULL.
 */

void *alloc(uint32_t size)
{
	uint32_t *ptr;
	uint32_t *end;
	uint32_t *found;
	uint32_t asize;
	
	/*
	 *	Step 1: Change the size of the allocation to align
	 *	to 16 bytes, and add in the bookkeeping chunk that
	 *	we allocate. Our pointer will return the first
	 *	byte past the bookkeeping memory.
	 */
	
	size = size + 4;		// Add bookkeeping
	if (0 != (size % 16)) size += 16 - (size % 16);	// align
	
	/*
	 *	Step 2: scan to find a space where this will fit.
	 *	Notice that we may have to piece together multiple
	 *	free blocks, since our free doesn't glue together
	 *	free blocks.
	 */
	
	ptr = (uint32_t *)(((uint8_t *)GRam) + 12);
	end = (uint32_t *)(((uint8_t *)GRam) + RAMSIZE);
	while (ptr < end) {
		if (0 == (1 & *ptr)) {
			/* Lowest bit is clear; this is allocated memory. */
			/* The bookkeeping area holds the total size of the */
			/* allocated area in bytes; thus, we can skip to */
			/* the next block by adding the bookkeeping area */
			/* to the current block. */
			ptr = (uint32_t *)(((uint8_t *)ptr) + *ptr);
		} else {
			/*
			 *	This area is free. Note that this is found, then
			 *	start scanning free areas until we piece together
			 *	something big enough to fit my request
			 */
			
			found = ptr;
			asize = *ptr & ~1UL;		// Get the size, clearing free bit
			ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));	// next block
			while ((asize < size) && (ptr < end)) {
				if (0 == (1 & *ptr)) {
					/* We bumped against an allocated block of memory */
					/* Exit this loop and continue scanning */
					break;
				}
				
				/* Block is free. Add it up to this block and continue */
				asize += *ptr & ~1UL;
				ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));	// next block
			}
			if (asize = end) return NULL;		// Could not find free space
	
	/*
	 *	Step 3: mark the block as allocated, and split the last free block
	 *	if needed
	 */
	
	*found = size;						// mark the size we've allocated.
	if (size < asize) {
		/* We have more than enough memory. Split the free block */
		/* First, find the pointer to where the free block should go */
		ptr = (uint32_t *)(((uint8_t *)found) + size);
		
		/* Next, mark this as a free block */
		ptr = (asize - size) | 1;
	}
	
	/*
	 *	The found pointer points to the bookkeeping block. We need to
	 *	return the free memory itself, which starts with the first byte
	 *	past the bookkeeping mark.
	 */
	
	return (void *)(((uint8_t *)found) + 4);
}

We can now use this code to illustrate some interesting things about heap memory allocation.

First, notice how free() simply marks the memory as free, but without overwriting the memory. This is why the following code, while quite illegal, can still work:

int error1()
{
	/* Allocate some memory and initialize it */
	int *someWord = (int *)alloc(sizeof(int));
	*someWord = 5;
	
	/* Free the memory */
	free(someWord);
	
	/* Now access the freed memory */
	return *someWord;
}

This is not guaranteed to work. Far from it, it’s illegal to access memory that was released after it was released–you don’t know what is going to happen to that chunk of memory. But in most cases, it’s simply marked as no longer reserved–but the values are still there in memory.

And this becomes a problem if the memory is reallocated to some other pointer which does something else with the memory:

int error2()
{
	/* Allocate some memory and initialize it */
	int *someWord = (int *)alloc(sizeof(int));
	*someWord = 5;
	
	/* Free the memory */
	free(someWord);
	
	/* Allocate some other memory */
	int *someOtherWord = (int *)alloc(sizeof(int));
	*someOtherWord = 6;
	
	/* Now access the freed memory */
	return *someWord;
}

What makes bugs like this very frustrating to find is that, in general, the patterns of allocs and frees are not quite so uniform. It can be quite unpredictable; for example, while the above probably will cause 6 to be returned using our version of alloc and free, the following may or may return 5 or 6 or even some other value, depending on how memory has been fragmented in the past:

int error3()
{
	/* Allocate some memory and initialize it */
	int *someWord = (int *)alloc(sizeof(int));
	*someWord = 5;
	
	/* Free the memory */
	free(someWord);
	
	/* Allocate some other memory */
	int *someOtherWord = (int *)alloc(sizeof(28));
	*someOtherWord = 6;
	
	/* Now access the freed memory */
	return *someWord;
}

Because the size of the second allocation is different than the first, it may or may not use the previously allocated memory.

Second, over time memory can fragment. Fragmentation can cause things to slow down over time, and they can even put you in the situation where after doing a bunch of small allocations and frees, you may still have plenty of memory left–but no block is large enough to put your request.

Different memory management methods attempt to resolve this problem using various techniques, of course–and on most modern operating systems there is plenty of memory so fragmentation is unlikely.

separator.png

As an aside, sometimes it is useful to allocate a whole bunch of tiny little objects and release them all at once. For example, a 3D rendering program may dynamically allocate a whole bunch of objects during rendering–but free them only after the entire image is drawn on the screen.

To do this you can allocate large chunks of memory, then subdivide the memory as needed, keeping the large allocated chunks of memory in a linked list, so when it comes time to free all of memory, the free operation can be done in one call.

separator.png

Another interesting thing to point out is that memory has to be explicitly allocated or freed. We are just one step above address registers in the microprocessor; our heap is something we must manually maintain. If we set a pointer to a new address, and we don’t free the memory that our pointer used to be pointing to, the free() routine has no idea that our memory must be freed.

In C and C++ this is the status quo: you must explicitly allocate or free memory. C++ adds the concept of ‘new’ and ‘delete’ which call a class constructor after the memory is allocated and the class destructor before the memory is freed; however, memory must still be explicitly allocated or freed.

In a world where there is only global memory, auto (stack-based) memory and heap memory this works okay. But once we start talking about object-oriented programming it is natural for us to talk about two pointers pointing to the same object in memory. And knowing when that object should be freed() becomes far more complicated than in the traditional procedural-based allocation scheme where we guarantee by convention that only one pointer holds onto a chunk of memory at a time.

And there are two ways we can solve this problem: resource counting and garbage collection.

Garbage Collection

Garbage collection is a technique whereby the operating system will automatically find memory that is no longer being used. The advantage of garbage collection is that you don’t have to remember to call free(). The disadvantage is that garbage collection can be computationally expensive and hard to get right.

There are several ways to handle garbage collection. The technique I’ll outline here is the simple mark and sweep technique to find (and mark) all memory that is currently being used, then sweeping through and freeing memory that is not marked.

Essentially it works like this.

With each allocated chunk of memory, we also reserve a bit used to mark that memory. From our allocator above, we could use the second to least significant bit to represent marking. We need a mark routine to mark the memory as such:

/*	mark
 *
 *		This marks memory. This marks the pointer by flipping the mark
 *	bit
 */

void mark(void *ptr)
{
	if (ptr == NULL) return;
	
	/*
	 *	Subtract 4 bytes from the current pointer to get the
	 *	address of the bookkeeping memory
	 */
	
	uint32_t *bookkeeping = (uint32_t *)(((uint8_t *)ptr) - 4);
	
	/*
	 *	Mark the memory by setting the second lowest bit
	 */
	
	*bookkeeping |= 2;
}

The first step to garbage collection is to do the mark phase: to mark all of the memory that is currently referred to by other chunks of memory in our system. To do this we use a ptr variable which points at the current block; as we sweep forward across all the blocks in the system, if we find a block that is marked, we then try to find all the pointers in that block, and mark them. This repeated marking continues until we no longer have any new memory blocks that need to be marked.

After we’ve done this marking, we sweep, freeing all blocks of memory that is not marked.

The hard part of any memory collection mechanism is to know in global memory and on the stack which are the address pointers and which are not. Languages such as Java keep class information around to allow the garbage collector to know exactly which things in memory refer to address pointers. Other languages, such as C, do not maintain this information–and so the garbage collector effectively “guesses.”

We assume in our code we have three methods: one which will mark all pointers in global memory and on the stack, another which returns the number of pointers inside an allocated memory object, and a third which returns the pointers in an allocated memory object.

/*	allocGC
 *
 *		Allocate but with a garbage collector. We do the mark/sweep phase
 *	on memory. We rely upon two other calls: a call to mark all pointers in
 *	global memory and on the stack, and a call which can tell us within an
 *	allocated chunk of memory which are the pointers by marking them.
 *
 *		In both cases we assume the mark routine will call mark() above
 *	to mark the memory as in use
 */

void *allocGC(uint32_t allocLen)
{
	uint32_t *ptr;
	uint32_t *end;
	uint32_t *tmp;
	uint32_t *tmp2;
	uint16_t len,i;

	/*
	 *	Try to allocate
	 */
	
	ptr = alloc(allocLen);
	
	/*
	 *	Out of memory?
	 */
	
	if (ptr == NULL) {
		/*
		 *	Out of memory; do garbage collection. First, clear the mark
		 *	bits for allocated memory
		 */
		
		ptr = (uint32_t *)(((uint8_t *)GRam) + 12);
		end = (uint32_t *)(((uint8_t *)GRam) + RAMSIZE);
		while (ptr < end) {
			*ptr &= ~2UL;		// clear second least bit
								// move to next block in memory
			ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));
		}
		
		/*
		 *	Now ask to mark memory on the stack and in global heap space
		 */
		
		markGlobalStack();
		
		/*
		 *	Run through and mark all references. We rely upon the 
		 *	routines numPointersInAllocBlock and pointerInAllocBlock
		 *	to return the pointers inside a block
		 */
		
		ptr = (uint32_t *)(((uint8_t *)GRam) + 12);
		while (ptr < end) {
			/*
			 *	If this pointer is marked, then find all the pointers that
			 *	are inside this pointer, and mark them, moving the pointer
			 *	backwards to the earliest unmarked object
			 */
			
			if (0 != (*ptr & 2)) {
				/*
				 *	Memory marked. Find all the pointers inside
				 */
				
				tmp2 = ptr;
				len = numPointersInAllocBlock(ptr);
				for (i = 0; i  tmp2) tmp2 = tmp;
						*tmp &= 2;
					}
				}
				
				/*
				 *	We may have moved tmp2 before ptr; this means we need
				 *	to pick up sweeping from tmp2, since we have a pointer
				 *	pointing backwards in memory
				 */
				
				ptr = tmp2;
			}
			
			/*
			 *	Move to next block
			 */
			
			ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));
		}
		
		/*
		 *	Now that we've gotten here, we need to free all allocated memory
		 *	that is not marked
		 */
		
		ptr = (uint32_t *)(((uint8_t *)GRam) + 12);
		while (ptr < end) {
			if (0 == (0x3 & *ptr)) {		// allocated, unmarked?
				*ptr |= 1;					// mark as freed
			}
			/*
			 *	Move to next block
			 */
			
			ptr = (uint32_t *)(((uint8_t *)ptr) + (*ptr & ~1UL));
		}
		
		/*
		 *	Now try again
		 */
		
		ptr = alloc(allocLen);
	}
	return ptr;
}

Essentially our garbage collector starts by sweeping all memory and clearing the mark bit.

gc1.png

We start by marking all the objects (using markGlobalState) that are pointed to by objects on the stack, or by objects in global memory:

gc2.png

Now we start in the loop to run through the blocks. Our ptr routine searches for the next marked block of memory, and then marks all the blocks that object points to:

gc3.png

We continue to mark foward, moving the pointer backwards only if we encounter a pointer to an earlier block in memory that is currently unmarked. This rewinding of the pointer is necessary to handle backwards pointing references without having to rewalk all of the blocks from the start of the heap:

gc4.png

(Because this pointer refers to something before the previous pointer, we move our pointer backwards:)

gc5.png

Once we’re done–and this is guaranteed to complete because there is only a finite number of objects, and we only rewind the pointer when something is unmarked–we then sweep through all of memory, marking as free objects we were unable to reach. We know these objects are freed because we could not reach them:

gc6.png
separator.png

Reference Counting

Garbage collection is very difficult to do correctly. You need to have a method, no matter in what state you’re in, to know where the pointers are, and what they point to. And you have to have a way to look inside of every object and know what chunks of memory in the heap are pointers, in order to correctly mark things.

In other words, you need to provide markGlobalState, numPointersInAllocBlock, and pointerInAllocBlock or the equivalent.

An easier way to track the pointers pointing to an object is by tracking a reference count associated with each object. This requires some work on the part of the developer; in fact, it requires that you explicitly call routines similar to alloc and free to keep track of the reference count to a collection of objects. On the other hand, it doesn’t require a lot of work to get working correctly. And this technique has been adopted by Microsoft’s COM system and Apple’s Objective-C on the Macintosh or iOS operating systems.

(Yes, I know the latest versions of Objective C on the Macintosh provide garbage collection. However, you can still do reference counting, and you must do reference counting on iOS.)

separator.png

Reference counting is extremely easy to do. Essentially it involves having the objects you want managed via reference counting internally store a reference count. Newly allocated objects set the reference count to 1, and if the reference count reaches zero, the object frees itself.

In C++ we can easily declare a base object for reference counting:

class BaseObject
{
	public:
					BaseObject();
		
		void			Retain();
		void			Release();
		
	protected:
		virtual		~BaseObject();

	private:
		uint32_t		fRefCount;
};

Our base class stores a reference count, called fRefCount. When we allocate our object, we first set the reference count to 1:

BaseObject::BaseObject()
{
	fRefCount = 1;
}

We then need to mark something as retained: meaning there is a new pointer pointing to the same object. The retain method can be written:

void BaseObject::Retain()
{
	++fRefCount;
}

Releasing the object then decrements the count, and if the count reaches zero, frees the object:

void BaseObject::Release()
{
	if (0 == --fRefCount) {
		delete this;
	}
}

In Objective C (but not on Microsoft COM) we have an additional method, called “autorelease”, which adds the object to an NSAutoreleasePool, which automatically calls release when the pool is drained, either explicitly or implicitly at the end of each event loop. In C++ we can do something similar by extending our base class by adding an array of object pointers:

class BaseObject
{
	public:
					BaseObject();
		
		void			Retain();
		void			Release();
		void			AutoRelease();
		
		static void	Drain();
		
	protected:
		virtual		~BaseObject();

	private:
		uint32_t		fRefCount;
		static std::vector gPool;
};

We then add an AutoRelease and a Drain methods:

void BaseObject::AutoRelease()
{
	gPool.push_back(this);
}

void BaseObject::Drain()
{
	/* Run through our array of objects */
	int i,len = gPool.size();
	for (i = 0; i Release();
	}
	/* Now erase the array */
	gPool.clear();
}
separator.png

Notice that simple assignment of pointers doesn’t actually call Retain or Release anymore than it automatically called alloc() and free() described earlier. Instead, we must use a coding convention to know when we should Retain, when we should Release, and (if present) when we should AutoRelease.

The Microsoft COM rules are quite simple: if a function call returns a pointer, you must release that pointer in your routine when you are no longer using it. So, for example, suppose we have a routine GetThing() which returns a BaseObject:

BaseObject *GetThing()
{
	return new BaseObject();
}

Then a caller must in turn release the value:

void UseBaseThing()
{
	BaseObject *obj = GetThing();
	
	/* Do some stuff to this */
	
	obj->Release();
}

Now if the routine GetThing is returning a reference to an object that it is storing in memory, then when the object is returned, the routine’s return value “owns” the reference to that global object, but the ownership must continue to be held locally. So you would use Retain:

BaseObject *gPtr;

BaseObject *GetThing()
{
	gPtr->Retain();
	return gPtr;
}

And similarly, if the caller function wants to hold onto the pointer (say, by storing it in a global variable), rather than call retain we simply keep ownership of the pointer:

BaseObject *gPtr2;

void UseBaseThing()
{
	BaseObject *obj = GetThing();
	/* Store object; don't release it--we still own the reference. */
	gPtr2 = obj;
}

The rule is quite simple: if a pointer is returned, the pointer must be released. But it adds complexity: it means you must constantly be calling the ‘Release’ method all the time, and that can become quite cumbersome.

On the other hand, and the key point to all of this, is that the retain and release rules are simple–and they are local: you don’t need to know how any other module in the system works, you only need to know what to do locally.

separator.png

Apple gets around the problem of constantly having to release objects (and retain objects that are held locally) by introducing the -autorelease method.

On the Macintosh (and iOS), the rules are given on Apple’s web site. The two rules are:

(1) You gain ownership of an object only if you create it (using -alloc or any method that starts with ‘new’, or contains ‘copy’), or if you explicitly retain the object.

(2) You release or autorelease the object when you no longer need to hold ownership of the object.

Using these two rules, we wind up writing less code–but things can get a little more complicated. In our example above, our ‘GetThing’ routine, because it does not start with ‘new’, would simply return the object:

BaseObject *gPtr;
+ (BaseObject *)getThing
{
	return gPtr;
}

If we were allocating this object (as in our first example), we would, because of the naming convention either rename our method to ‘newThing’:

+ (BaseObject *)newThing
{
	return [[BaseObject alloc] init];
}

Or we would need to make sure that ownership is not passed up by marking the object as autorelease:

+ (BaseObject *)getThing
{
	return [[[BaseObject alloc] init] autorelease];
}

In fact, this pattern is so common you’ll find yourself typing “alloc] init] autorelease]” nearly on autopilot.

The call “UseBaseThing” is similarly changed, depending on how we’re using it. If we don’t hold onto a reference to the object, we would not need to call ‘release’ because we aren’t hanging onto the object:

+ (void)useBaseThing
{
	BaseObject *obj = [BaseObject getThing];
	
	/* Do some stuff to this */
	/* Notice we don't release this object */
}

And if we are hanging onto the object, we must retain the object:

+ (void)useBaseThing
{
	BaseObject *obj = [BaseObject getThing];
	gPtr2 = [obj retain];
}

Likewise, if we are calling newThing, we’d be getting ownership back from the call–so we would need to release it as appropriate. So, our examples would be:

+ (void)useBaseThing
{
	BaseObject *obj = [BaseObject newThing];
	
	/* Do some stuff to this */

	[obj release];
}

And, if holding the object, we already have ownership, so we don’t need to release it:

+ (void)useBaseThing
{
	BaseObject *obj = [BaseObject getThing];
	gPtr2 = obj;
}
separator.png

Notice in all of the above, simply assigning or copying pointers around in a pointer doesn’t actually do anything on its own. A pointer is simply like an integer, but with the integer referring to an address in memory. In all of this evolution from fixed locations in RAM to garbage collection and object reference counting, we have never changed the immutable fact that simply copying or adding values to an address doesn’t affect how heap memory is handled. Garbage collection takes place after an object is no longer pointed to–and sometimes objects that are no longer referenced by a pointer can live for a very long time.

There are other subtleties that can take place here: different versions of memory management tools may do different things when a chunk of memory is allocated or freed. Some test tools may even store the location in a program where an object is allocated, so if there is an unbalanced alloc/free cycle, the line of code in error can be discovered easily. And there are many flavors of garbage collection out there which each behave differently, though ultimately result in the same end.

Further, with reference counting, cycles of objects can easily be created which cause the objects to linger long after those objects are no longer actually in use. It’s why it is important to think about cycles of pointers, especially when developing UIView objects which may hold references to other UIViews to perform certain operations.

In fact, a very common bug is to create one UIView that refers to another:

@class BView;
@interface AView : UIView
{
	BView *view;
}

@property (retain) BView *view;
@end

@interface BView : UIView
{
	AView *view;
}

@property (retain) AView *view;
@end

The problem here is that if AView and BView are part of the same view hierarchy, and are assigned to each other, then when the view hierarchy is released, AView and BView retain each other–preventing the memory from being reclaimed even though it is no longer being used.

In cases like this, when a third object holds two others which refer to each other–in this case, implicitly, by the UIWindow holding all of the views in the view hierarchy–it is better if only an unretained reference is held to these objects instead:

@class BView;
@interface AView : UIView
{
	BView *view;
}

@property (assign) BView *view;
@end

@interface BView : UIView
{
	AView *view;
}

@property (assign) AView *view;
@end

Assignment is not ownership, and when the window is released, AView and BView will both be released successfully.

separator.png

Hopefully this brief overview of memory management, from setting fixed locations in RAM to heaps to garbage collection and reference counting, will give you a good idea of how memory management actually works–and what the pitfalls are.

The key takeaways should be:

(1) Assigning pointers is not ownership. Simply writing Pointer *a = b; doesn’t actually grant ownership to ‘a’ when it points to ‘b’. You must, unless in a garbage collection environment, explicitly ask for and release ownership.

(2) If not using reference counting, the assumption is that only one pointer actually “owns” an object. For object-oriented programming this is too strict a restriction, and thus we must either introduce garbage collection or establish reference counting and conventions on how references are counted.

(3) For reference counting systems, there are (to my knowledge) two conventions for reference counting: the Microsoft COM convention, and the Apple Objective-C convention. You must also be aware of cyclical ownership references: if you accidentally grant ownership that turns into a cycle or ring of ownership, objects may leak because they mutually refer to each other, without actually being used anywhere else.