We’re finally at the last stretch for 3D projection! In this post, we want to transform what is known as the view frustum1 into a cube. Recall from the previous post, the following is a perspective view frustum:

Let’s first consider how to map the screen coordinates. These would be the x- and y-coordinates. I have conveniently avoided discussing the near and far clipping planes in previous posts, but their presence is worth discussing. In reality, we don’t have clipping planes in our view. That’s why you can see Mars with your naked eye at night, but most games seem to have some sort of fog or terrain in the distance that can never be reached2. Some of the reasons for this limitation:

  1. Depth information is stored as a floating-point value. You could increase this, but you could never match reality (infinite viewing distance) without tanking performance or running into z-fighting issues3.
  2. The next limitation is likely your graphics card. You would not be able to render the number of objects that could fit in an infinitely long viewing frustum without running out of memory. Even if you could, it would definitely not be in real-time4.

Derivation

Starting with the yy-coordinate, let’s imagine the xx-axis is pointing out of the screen from the origin (or we are looking down the xx-axis).

This makes it clear that the screen y-coordinate can be determined purely from values on the y- and z-axes. Firstly, because we’re using a right-handed system but have to map to the canonical view volume, which is left-handed, we’re using the fact that the camera “looks down” its negative z-axis. So this is the frustum after the camera has placed all objects in the world relative to its own origin in a right-handed coordinate system. This took me a long time to digest, so feel free to draw it out or do the maths with me to help your understanding5.

If you want to, you can depart from my derivations and try and do a purely left-handed system (from world coordinates all the way to screen projection). You will run into fewer issues and probably tear out less hair. Since I apparently have a penchant for pain, we’ll keep moving forward with a mixed system.

So, from the above, we can determine the value of ysy_s (yy projected onto the screen) using similar triangles6. Noting that the coordinate of the point (vertex) in space that we are projecting is at (x,y,z)(x, y, z).

ysn=yzys=nyz\begin{align} \frac{y_s}{-n} &= \frac{y}{z} \\ \therefore y_s &= \frac{ny}{-z} \end{align}

The near and far clipping planes are simply specified as positive values, so I’ve explicitly negated them so that their signs match the zz-coordinate’s sign7. Similarly, we can imagine looking down the y-axis to determine the screen xx-coordinate, xsx_s:

xsn=xzxs=nxz\begin{align} \frac{x_s}{-n} &= \frac{x}{z} \\ \therefore x_s &= \frac{nx}{-z} \end{align}

I’ve moved the sign next to the zz-coordinate because we are going to take advantage of homogeneous coordinates, again. We can construct the desired transformation matrix as follows:

[n0000n0000m1m20010][xyzw]=[xs=nxys=nyzs=z2z]\begin{bmatrix} n & 0 & 0 & 0 \\ 0 & n & 0 & 0 \\ 0 & 0 & m_1 & m_2 \\ 0 & 0 & \color{red}{-1} & 0 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ w \\ \end{bmatrix} = \begin{bmatrix} x_s = nx \\ y_s = ny \\ z_s = z^2 \\ -z \\ \end{bmatrix}

The approach to determining the above is to draw an empty 4x4 transformation matrix (a partial perspective projection). I then filled in the coordinates for a vertex (x,y,z,1)(x, y, z, 1) and the resultant answer. I knew that I wanted to divide the xx- and yy-coordinates by z-z, so I reserved that in the final row of the resultant vector through the use of the 1\color{red}{-1}.

Now, this forces us to divide the zsz_s coordinate by z-z when going from clip-space to NDC. We want to preserve the initial zz as-is, but remove the negative signs to pass to the left-handed orthographic projection derived in the previous post. Therefore we must have:

zsz=zzs=z2\begin{align} \frac{z_s}{-z} &= -z \\ z_s &= z^2 \end{align}

Finally, we have the third row of the transformation matrix. We can intuitively assume that xx and yy do not contribute to remapping the zz-coordinate back to its original scale (with the sign flipped into a positive). So we assume the last two elements, m1m_1 and m2m_2, of the row are unknown.

m1z+m2=z2m_1z + m_2 = z^2

We know that this equation must be satisfied at the near and far clipping planes.

m1f+m2=f2m1n+m2=n2m2=n2+m1n\begin{align*} -m_1f + m_2 &= f^2 \tag{1} \\ -m_1n + m_2 &= n^2 \tag{2} \\ m_2 &= n^2 + m_1n \tag{3} \end{align*}

Substituting (3) into (1) we have:

m1f+n2+m1n=f2m1(nf)+n2=f2m1=f2n2nf=(fn)(f+n)nf=(nf)(f+n)nf=(f+n)m1=fn\begin{align*} -m_1f + n^2 + m_1n &= f^2 \\ m_1(n-f) + n^2 &= f^2 \\ m_1 &= \cfrac{f^2 - n^2}{n-f} \\ &= \cfrac{(f-n)(f+n)}{n-f} \\ &= -\cfrac{(n-f)(f+n)}{n-f} \\ &= -(f+n) \\ \therefore m_1 &= -f-n \tag{4} \end{align*}

Finally, substituting (4) into (3) we have:

m2=n2+(fn)n=n2fnn2m2=fn\begin{align*} m_2 &= n^2 + (-f-n)n \\ &= n^2 - fn - n^2 \\ \therefore m_2 &= -fn \end{align*}

Substituting in, we get the perspective-to-orthographic transformation matrix POP_O:

PO=[n0000n0000fnfn0010]\begin{align*} P_O = \begin{bmatrix} n & 0 & 0 & 0 \\ 0 & n & 0 & 0 \\ 0 & 0 & -f-n & -fn \\ 0 & 0 & -1 & 0 \end{bmatrix} \end{align*}

Assuming the camera is centered on the z-axis, for the orthographic projection we have r=l\ni r=-l, t=bt=-b. Therefore, we can infer the following:

r+l=0rl=2rb+t=0bt=2b\begin{align*} r+l &= 0 \\ r-l &= 2r \\ b+t &= 0 \\ b-t &= 2b \\ \end{align*}

Now, combining the orthographic and perspective-to-orthographic matrices, we get the perspective projection matrix, PP.

P=OPO=[2rl00r+lrl02tb0t+btb001fnnfn0001][n0000n0000fnfn0010]=[nr0000nb0000ffnfnfn0010]\begin{align*} P = O P_O &= \begin{bmatrix} \frac{2}{r-l} & 0 & 0 & -\frac{r+l}{r-l} \\ 0 & \frac{2}{t-b} & 0 & -\frac{t+b}{t-b} \\ 0 & 0 & \frac{1}{f-n} & -\frac{n}{f-n} \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} n & 0 & 0 & 0 \\ 0 & n & 0 & 0 \\ 0 & 0 & -f-n & -fn \\ 0 & 0 & -1 & 0 \end{bmatrix} \\ &= \begin{bmatrix} \frac{n}{r} & 0 & 0 & 0 \\ 0 & -\frac{n}{b} & 0 & 0 \\ 0 & 0 & -\frac{f}{f-n} & -\frac{fn}{f-n} \\ 0 & 0 & -1 & 0 \\ \end{bmatrix} \end{align*}

We’re nearly there, however we can make some additional reductions to this matrix. Let’s consider the view frustum again with some additional annotations:

Using the definition of tan\tan, and acknowledging that b=h2-b = \frac{h}{2} (bb is negative), we can construct the following:

b=ntanθ2\begin{align*} -b &= n\tan{\frac{\theta}{2}} \\ \end{align*}

We can specify the aspect ratio as the ratio between the screen width and screen height (a=wh)\left(a = \frac{w}{h}\right). So, we can derive the value for rr as:

a=wh=w/2h/2=rbr=a(b)r=antanθ2\begin{align*} a &= \frac{w}{h} \\ &= \frac{w/2}{h/2} \\ &= \frac{r}{-b} \\ r &= a(-b) \\ \therefore r &= an\tan{\frac{\theta}{2}} \\ \end{align*}

Therefore, the final perspective projection matrix can be given as:

P=[1atanθ200001tanθ20000ffnfnfn0010]=[1atanθ200001tanθ20000fnffnnf0010]\begin{align*} P &= \begin{bmatrix} \frac{1}{a\tan{\frac{\theta}{2}}} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan{\frac{\theta}{2}}} & 0 & 0 \\ 0 & 0 & -\frac{f}{f-n} & -\frac{fn}{f-n} \\ 0 & 0 & -1 & 0 \end{bmatrix} \\ &= \begin{bmatrix} \frac{1}{a\tan{\frac{\theta}{2}}} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan{\frac{\theta}{2}}} & 0 & 0 \\ 0 & 0 & \frac{f}{n-f} & \frac{fn}{n-f} \\ 0 & 0 & -1 & 0 \end{bmatrix} \end{align*}

And that’s it! With all of these combined, you can create a mapping from 3D space to a perspective-projected space represented in normalized device coordinates. Unfortunately, we still need to do some work in the WebGPU series before we get to see the application, but it is very close!

Footnotes

  1. Frustum is Latin for “morsel” or “piece cut off”. 

  2. Commonly referred to as a skybox

  3. https://en.wikipedia.org/wiki/Z-fighting 

  4. The 60+ FPS dream is dead. 

  5. I had a sign flipped in the view matrix for the longest time when originally figuring these out for myself – it was maddening. 

  6. A trick from geometry that I find keeps popping up all over the place: https://en.wikipedia.org/wiki/Similarity_(geometry)

  7. Shakes left fist at right-handed coordinate system.