allthingsreversed.io

Articles for the allthingsreversed.io blog

View on GitHub

A 29-Year-Old Bug in Quake II

I’ve been working on potatOS - a hobby 64-bit OS kernel that boots on QEMU. One of the recent milestones was getting Quake II running on it, using id Software’s original GPL source code compiled for x86_64 with the software renderer (ref_soft). The game worked surprisingly well - until I walked into a specific area and the whole thing crashed.

The crash

The crash was consistent and reproducible. Every time I entered a particular area of the map, the OS killed the Q2 task with a page fault:

[PAGE FAULT] cr2=0x7 errcode=0x7 RIP=0x480eca RSP=0x3fff8750
[PAGE FAULT] rbp=0x4 rbx=0x5
[PAGE FAULT] rdi=0x511 rsi=0x2ba2bd rdx=0x40000000 rcx=0x2ba2bd
[PAGE FAULT] r8=0xffffe6e9 r9=0x8c r12=0x3fff9598
[PAGE FAULT] User-mode fault — killing task
[SCHEDULER] Task 2 exiting with code 139

CR2=0x7 - that’s a near-null pointer dereference. The error code 0x7 means present + write + user mode. Something is writing to address 0x7 from user space. Let’s find out what.

Disassembly at the crash site

RIP=0x480eca. Let’s look at what’s there:

$ objdump -d dist/userspace/quake2.elf --start-address=0x480da0 --stop-address=0x480ef0

The crash lands inside R_PolygonScanLeftEdge, a function in Q2’s software renderer that fills in horizontal span data for polygon edges:

R_PolygonScanLeftEdge:
  ...
  480dcd: mov    r12, [rip+0x9501bc]     ; dd0f90 <s_polygon_spans>
  ...
  480e60: cvttss2si r9d, xmm0           ; r9d = itop (starting row)
  480e71: cvttss2si edi, xmm1           ; edi = ibottom (ending row)
  ...
  480ebc: mov    rdx, r12               ; rdx = s_polygon_spans (base pointer)
  480ebf: mov    eax, r9d               ; eax = itop
  ; --- inner loop: fill one span per scan line ---
  480ec8: mov    esi, ecx
  480eca: mov    [rdx+4], eax           ; <-- CRASH: write to rdx+4
  480ecd: add    eax, 1
  480ed0: add    ecx, r8d
  480ed3: sar    esi, 0x10
  480ed6: add    rdx, 0x18              ; rdx += sizeof(espan_t)
  480eda: mov    [rdx-0x18], esi
  480edd: cmp    edi, eax
  480edf: jne    480ec8

The loop iterates from itop (r9) to ibottom (edi), writing span entries. Each espan_t is 0x18 = 24 bytes on 64-bit (3 ints + a pointer). Now look at the register values from the crash dump:

The screen resolution is 320x240. A scan line of 1297 makes no sense on a 240-pixel-tall screen. These are projected screen-space vertex coordinates that should have been clipped to the viewport. Something went very wrong during vertex projection.

Tracing the source

The function R_PolygonScanLeftEdge is called from R_DrawPoly (in ref_soft/r_poly.c), which is in turn called from R_ClipAndDrawPoly. The latter clips polygon vertices to the view frustum in world space, then projects them to screen coordinates:

// ref_soft/r_poly.c, R_ClipAndDrawPoly()

// clip to the frustum in worldspace
nump = r_polydesc.nump;
for (i=0 ; i<4 ; i++)
{
    nump = R_ClipPolyFace (nump, &view_clipplanes[i]);
    if (nump < 3) return;
}

// transform vertices into viewspace and project
pv = &r_clip_verts[clip_current][0][0];

for (i=0 ; i<nump ; i++)
{
    VectorSubtract (pv, r_origin, local);
    TransformVector (local, transformed);

    if (transformed[2] < NEAR_CLIP)
        transformed[2] = NEAR_CLIP;

    pout = &outverts[i];
    pout->zi = 1.0 / transformed[2];

    pout->s = pv[3];
    pout->t = pv[4];

    scale = xscale * pout->zi;
    pout->u = (xcenter + scale * transformed[0]);

    scale = yscale * pout->zi;
    pout->v = (ycenter - scale * transformed[1]);

    pv += sizeof (vec5_t) / sizeof (pv);   // <--- !!!!
}

There it is. Line 991. Do you see it?

pv += sizeof (vec5_t) / sizeof (pv);

pv is a float *. vec5_t is an array of 5 floats (20 bytes). The intent is to advance pv by 5 floats to the next vertex. Let’s evaluate:

Platform sizeof(vec5_t) sizeof(pv) Result
32-bit 20 4 5 (correct)
64-bit 20 8 2 (wrong!)

On 32-bit, sizeof(float *) happens to equal sizeof(float) - both are 4 bytes. The division gives 20/4 = 5, and the pointer advances correctly to the next vertex. On 64-bit, sizeof(float *) is 8, so 20/8 = 2 (integer division). The pointer advances by only 2 floats instead of 5, reading overlapping vertex data on each subsequent iteration.

The programmer meant to write sizeof(*pv) (the size of what the pointer points to - a float, 4 bytes) but wrote sizeof(pv) (the size of the pointer itself). This is a textbook 64-bit porting bug.

Why only that area?

R_ClipAndDrawPoly is the rendering path for translucent surfaces - water, glass, and alpha-blended textures. Most of the map renders through a different code path (R_DrawSolidClippedSubmodelPolygons / R_RenderFace). The buggy projection only executes when the engine encounters a transparent surface.

With the pointer advancing by 2 instead of 5, vertex data gets scrambled. In most cases, the garbage coordinates either produce a degenerate polygon that gets rejected by the ymin >= ymax early-out in R_DrawPoly, or the projected coordinates happen to stay within bounds. But in that specific area - likely containing a water surface or glass with the right vertex layout - the corrupted projection produced v-coordinates of 140 to 1297 on a 240px screen. The span writer then iterated 1157 times, writing past the stack-allocated espan_t spans[MAXHEIGHT+1] array (1201 entries x 24 bytes) and past the mapped stack boundary at 0x40000000.

The fix

One character:

-    pv += sizeof (vec5_t) / sizeof (pv);
+    pv += sizeof (vec5_t) / sizeof (*pv);

Every other similar line in the same file already uses the correct form:

for (i=0 ; i<nump ; i++, instep += sizeof (vec5_t) / sizeof (float))

This one instance was simply inconsistent. On 32-bit x86, where Q2 was developed and shipped in 1997, the bug was invisible. sizeof(float *) == sizeof(float) == 4. It took compiling the original, unmodified ref_soft renderer for a 64-bit target to surface it - 29 years later.

Checking if it’s known

Searching online, I couldn’t find this specific bug documented anywhere. Fabien Sanglard’s excellent Quake II code review doesn’t mention it - understandably, since it analyzes the 32-bit codebase where the bug doesn’t manifest. The Yamagi Quake II project - the main 64-bit clean Q2 source port - appears to have rewritten significant portions of the software renderer rather than patching individual issues like this. Their issue #5 (“quake2 crashes going underwater”) describes what is very likely the same root cause, though the issue thread doesn’t identify the specific line.

Modern Q2 source ports almost universally use OpenGL or Vulkan renderers, so nobody runs ref_soft on 64-bit anymore. The original software renderer has been effectively abandoned, and this bug has been silently waiting in r_poly.c line 991 since December 1997.

Takeaway

sizeof(pointer) vs sizeof(*pointer) is a classic 64-bit porting pitfall. On ILP32 platforms (where int, long, and pointers are all 32-bit), many sizeof expressions involving pointers accidentally produce correct results. Move to LP64 (64-bit pointers, 32-bit int) and they silently break.

If you’re ever porting C code from 32-bit to 64-bit, grep for sizeof applied to pointer variables - especially in pointer arithmetic expressions. The compiler won’t warn you. The code will compile cleanly. And it might even work most of the time, crashing only when you walk into the wrong room.