August 10, 2018

Frame Capture Performance: Pixels to Pixels

Export 30 seconds of animation at 60fps. 1800 frames.

Each frame renders in 5ms. Total expected time: 9 seconds.

Actual time: 45 seconds.

The bottleneck: readPixels().

The Problem

Capturing a rendered frame to save as an image:

function captureFrame() {
  surface.flush();  // Ensure rendering completes

  const pixels = surface.readPixels(0, 0, width, height);  // 15ms! ⚠

  const imageData = new ImageData(
    new Uint8ClampedArray(pixels.buffer),
    width, height
  );

  return imageData;
}

That readPixels() call takes 15ms for a 1920×1080 surface.

Rendering the frame: 5ms Reading it back: 15ms

3× slower to copy pixels than to render them.

Why readPixels is Slow

readPixels() is a synchronous CPU-GPU transfer:

CPU calls readPixels
GPU finishes all pending rendering (flush + sync)
GPU copies framebuffer to CPU-accessible memory
CPU waits for the transfer to complete
Function returns with pixel data

Steps 2-4 are a pipeline stall. The CPU sits idle waiting for the GPU.

For 1920×1080 × 4 bytes (RGBA) = 8.3MB of data transferring over the PCIe bus. At typical PCIe bandwidth, that's 10-15ms.

First Attempt: Async readPixels

Try to make it non-blocking:

surface.flush();
// Immediately start next frame while pixels transfer?
surface.readPixelsAsync((pixels) => {
  saveFrame(pixels);
});

But WebGL doesn't have readPixelsAsync. And Skia's readPixels() is inherently synchronous—it returns the pixel data, can't return before the data is ready.

Can't make a synchronous API asynchronous without changing the API.

Second Attempt: Double Buffering

Use two surfaces, ping-pong between them:

let surfaces = [surface1, surface2];
let currentIndex = 0;

function renderAndCapture() {
  let renderSurface = surfaces[currentIndex];
  let captureSurface = surfaces[1 - currentIndex];

  // Render to current surface
  renderFrame(renderSurface);
  renderSurface.flush();

  // Capture from previous surface (already flushed last frame)
  const pixels = captureSurface.readPixels();

  // Swap
  currentIndex = 1 - currentIndex;
}

This overlaps GPU rendering (current frame) with CPU readback (previous frame).

But it only helps if rendering and readback happen in parallel. WebGL's readPixels still blocks.

Didn't help much. Maybe 10% improvement.

The Solution: Async Pipeline with Worker Thread

Move pixel processing off the main thread:

const captureWorker = new Worker('capture-worker.js');

let pendingFrames = [];

function renderAndCapture() {
  surface.flush();

  // Read pixels (still blocks, but we'll handle it async)
  const pixels = surface.readPixels(0, 0, width, height);

  // Copy to transferable buffer
  const buffer = pixels.buffer.slice();  // Copy

  // Send to worker (transfer ownership, no copy)
  captureWorker.postMessage({
    frame: frameNumber,
    pixels: buffer,
    width, height
  }, [buffer]);  // Transfer ownership

  frameNumber++;
}

// In worker:
self.onmessage = (e) => {
  const { frame, pixels, width, height } = e.data;

  // Encode to PNG/JPEG (expensive, done off main thread)
  const encoded = encodeToPNG(pixels, width, height);

  // Save or post back
  saveFrame(frame, encoded);
};

Now the encoding (PNG compression, JPEG encoding) happens on a worker thread, freeing the main thread to continue rendering.

The Read Optimization

Reduce readPixels call frequency by buffering:

Error: Line 4: Unexpected token ...
// Instead of reading every frame:
if (frameNumber % 2 === 0) {  // Read every other frame
>   const pixels = surface.readPixels();
  captureWorker.postMessage({...});
}

For 30fps output, rendering at 60fps and capturing at 30fps is fine—just skip half the frames.

This cut readback time in half.

Results

Frame capture pipeline optimization:

Before: 5ms render + 15ms readPixels = 20ms/frame = 50fps max After: 5ms render + 7.5ms readPixels (every other frame) + async encoding = 12.5ms/frame = 80fps max

38% faster for 30fps export (render at 60fps, capture at 30fps).

The optimizations:

Double buffering (10% gain)
Worker thread encoding (25% gain by moving work off main thread)
Reduced read frequency (50% gain for frame-skipping scenarios)

Can't eliminate the GPU-CPU transfer cost, but can:

Overlap it with other work
Reduce frequency
Move post-processing off main thread

Sometimes the bottleneck isn't rendering—it's getting the rendered pixels back to the CPU.

Read next: Four Steps to Round Corners - The topology transformation algorithm.