Frame Capture Performance: Pixels to Pixels
Export 30 seconds of animation at 60fps. 1800 frames.
Each frame renders in 5ms. Total expected time: 9 seconds.
Actual time: 45 seconds.
The bottleneck: readPixels().
The Problem
Capturing a rendered frame to save as an image:
function captureFrame() {
surface.flush(); // Ensure rendering completes
const pixels = surface.readPixels(0, 0, width, height); // 15ms! ⚠
const imageData = new ImageData(
new Uint8ClampedArray(pixels.buffer),
width, height
);
return imageData;
}
That readPixels() call takes 15ms for a 1920×1080 surface.
Rendering the frame: 5ms Reading it back: 15ms
3× slower to copy pixels than to render them.
Why readPixels is Slow
readPixels() is a synchronous CPU-GPU transfer:
- CPU calls readPixels
- GPU finishes all pending rendering (flush + sync)
- GPU copies framebuffer to CPU-accessible memory
- CPU waits for the transfer to complete
- Function returns with pixel data
Steps 2-4 are a pipeline stall. The CPU sits idle waiting for the GPU.
For 1920×1080 × 4 bytes (RGBA) = 8.3MB of data transferring over the PCIe bus. At typical PCIe bandwidth, that's 10-15ms.
First Attempt: Async readPixels
Try to make it non-blocking:
surface.flush();
// Immediately start next frame while pixels transfer?
surface.readPixelsAsync((pixels) => {
saveFrame(pixels);
});
But WebGL doesn't have readPixelsAsync. And Skia's readPixels() is inherently synchronous—it returns the pixel data, can't return before the data is ready.
Can't make a synchronous API asynchronous without changing the API.
Second Attempt: Double Buffering
Use two surfaces, ping-pong between them:
let surfaces = [surface1, surface2];
let currentIndex = 0;
function renderAndCapture() {
let renderSurface = surfaces[currentIndex];
let captureSurface = surfaces[1 - currentIndex];
// Render to current surface
renderFrame(renderSurface);
renderSurface.flush();
// Capture from previous surface (already flushed last frame)
const pixels = captureSurface.readPixels();
// Swap
currentIndex = 1 - currentIndex;
}
This overlaps GPU rendering (current frame) with CPU readback (previous frame).
But it only helps if rendering and readback happen in parallel. WebGL's readPixels still blocks.
Didn't help much. Maybe 10% improvement.
The Solution: Async Pipeline with Worker Thread
Move pixel processing off the main thread:
const captureWorker = new Worker('capture-worker.js');
let pendingFrames = [];
function renderAndCapture() {
surface.flush();
// Read pixels (still blocks, but we'll handle it async)
const pixels = surface.readPixels(0, 0, width, height);
// Copy to transferable buffer
const buffer = pixels.buffer.slice(); // Copy
// Send to worker (transfer ownership, no copy)
captureWorker.postMessage({
frame: frameNumber,
pixels: buffer,
width, height
}, [buffer]); // Transfer ownership
frameNumber++;
}
// In worker:
self.onmessage = (e) => {
const { frame, pixels, width, height } = e.data;
// Encode to PNG/JPEG (expensive, done off main thread)
const encoded = encodeToPNG(pixels, width, height);
// Save or post back
saveFrame(frame, encoded);
};
Now the encoding (PNG compression, JPEG encoding) happens on a worker thread, freeing the main thread to continue rendering.
The Read Optimization
Reduce readPixels call frequency by buffering:
Error: Line 4: Unexpected token ...
// Instead of reading every frame:
if (frameNumber % 2 === 0) { // Read every other frame
> const pixels = surface.readPixels();
captureWorker.postMessage({...});
}
For 30fps output, rendering at 60fps and capturing at 30fps is fine—just skip half the frames.
This cut readback time in half.
Results
Frame capture pipeline optimization:
Before: 5ms render + 15ms readPixels = 20ms/frame = 50fps max After: 5ms render + 7.5ms readPixels (every other frame) + async encoding = 12.5ms/frame = 80fps max
38% faster for 30fps export (render at 60fps, capture at 30fps).
The optimizations:
- Double buffering (10% gain)
- Worker thread encoding (25% gain by moving work off main thread)
- Reduced read frequency (50% gain for frame-skipping scenarios)
Can't eliminate the GPU-CPU transfer cost, but can:
- Overlap it with other work
- Reduce frequency
- Move post-processing off main thread
Sometimes the bottleneck isn't rendering—it's getting the rendered pixels back to the CPU.
Read next: Four Steps to Round Corners - The topology transformation algorithm.