Particle Life on GPU CuPy port for long-form renders
Particle Life on GPU CuPy port for long-form renders
render_seed_480_panoramic_gpu.py (337 lines, single
file)Engine: CuPy + one custom CUDA splat kernel
Host:
ai.foxhop.net (RTX 4090, sm_89, 24
GB)Use: render seed-480 particle-life to MP4, MPEG-2 (DVD-Video), or HEVC for BD-R
What this script renders
Particle Life is a 2-D agent toy: N particles, K species, an asymmetric K×K force matrix, periodic torus boundary on both axes. One seed value fully determines the matrix, initial positions, & species assignments, so a given seed gives the same evolving universe every time on every GPU bit-for-bit. We picked seed 480 because its matrix produces a visually active universe that never stalls into a fixed point or limit cycle for hours of simulation time.
Single CuPy file drives four output targets via one MODE
switch. Each preset is a sealed dict at the top of our script — flip the
constant, run, get our target file. No CLI args needed.
Output modes
| mode | resolution | N particles | duration | target file |
|---|---|---|---|---|
| panoramic | 7680 × 2160 | 115 000 | 200 s | h264 .mp4 (~2 GB) |
| dvd_mp4 | 720 × 480 | 2 500 | 7.5 h | h264 .mp4 on data DVD-R |
| dvd_video | 720 × 480 | 2 500 | 2 h | MPEG-2 .mpg → real DVD-Video |
| bd_4k_6h | 3840 × 2160 | 115 000 | 6 h | HEVC .mp4 on BD-R |
dvd_video is our most-tested target: ffmpeg
-target ntsc-dvd produces a spec-compliant MPEG-2 Program
Stream with a silent AC-3 audio track that dvdauthor turns
into a VIDEO_TS/ folder. That folder gets
mkisofs -dvd-video packed into an ISO &
growisofs -dvd-compat -Z burnt to a Verbatim DVD-R blank.
Bootable on cheap DVD players & Xbox 360, verified on both
2026-06-29.
The four optimisations that mattered
Naïve CuPy port of our CPU renderer hit OOM on our 4090 at N=22 000 (peak 19.4 GB of 24 GB). Four changes brought peak down to 2.8 GB at the same N & raised throughput an order of magnitude:
1. Custom CUDA splat kernel — one launch per frame.
The 5×5 soft-particle splat was 25 np.add.at ops in our
CPU code, which became 25 sequential cp.add.at kernel
launches per frame. We replaced our entire splat with one
cp.RawKernel that takes
(pos, buf, kern, N, H, W) & atomically adds the 25
contributions for every particle inside one launch. The whole 720×480
frame splat now costs one kernel dispatch instead of 25.
2. ``cp.fuse`` on the force formula.
The piecewise force f(rn) — repulsive below β,
attractive between β & 1, zero beyond — was three array passes in
our pure-CuPy version. Wrapping it in @cp.fuse() lets CuPy
generate one fused elementwise CUDA kernel that does the whole branch on
one read of rn, Aij_pairs and one write to the
output.
3. Raw bytes straight into ffmpeg's stdin.
Our CPU script wrote PNGs to disk & let ffmpeg read them back. At
540 000 frames × 720×480 grayscale that would have been 186 GB of raw
data on /tmp — wouldn't fit. We open ffmpeg as a subprocess
with stdin=PIPE, send each frame as
frame.tobytes(), & ffmpeg encodes in parallel with our
simulation. Disk usage stays bounded at our final output size (~4 GB for
the 2-hour DVD-Video).
4. CUDA async memory pool + periodic ``free_all_blocks``.
cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool().malloc)
lets the runtime overlap allocation with kernels.
mem_pool.free_all_blocks() fires every 32 frames so peak
allocation doesn't climb monotonically across our 216 000-frame render.
The two together kept us under 3 GB peak across our entire 2-hour
render.
Benchmarks
Wall-clock fps on RTX 4090, seed 480, our four presets:
| mode | resolution | N | wall fps |
|---|---|---|---|
| CPU baseline (panoramic) | 2880 × 1080 | 22 000 |
|
| GPU panoramic | 7680 × 2160 | 115 000 |
|
| GPU dvd_video | 720 × 480 | 2 500 | 326 fps |
The 2-hour DVD-Video (216 000 frames) renders in about 11 minutes wall time on our 4090. The CPU-version equivalent at the same resolution & N would have taken roughly 9 hours.
Simulation kernel (short recap)
Per particle, per frame:
- Spatial hash — bin all N particles by
floor(pos / rmax)into a grid; sort by cell index;cp.searchsortedgives acell_start[cell_id]array in one pass. - Candidate pairs — for each particle, look up its 3×3 neighbour cells; concatenate every particle in those cells as a candidate for force interaction. O(N) instead of O(N²).
- Distance + torus wrap — apply periodic boundary in
both x & y by
d -= S * round(d / S). - Force — fused kernel returns the piecewise force
value; multiply by direction & species-pair coefficient
A[typ_i, typ_j]; scatter back toaccviacp.add.at. - Integrate —
vel = vel * friction + acc * dtthenpos = (pos + vel * dt) % S. - Render — decay buffer, splat 5×5 gaussian per particle, log normalise, emit one uint8 grayscale frame to ffmpeg stdin.
friction = 0.85, dt = 0.6,
fs = 0.8, rmax = 88, beta = 0.3,
K = 5. Buffer decays at 0.80 between frames which gives the
visible motion trails.
Burn pipeline (DVD-Video target)
Once MODE = 'dvd_video' produces
animation.mpg (~3.9 GB), four shell steps cut a bootable
DVD:
# 1. author VIDEO_TS/ from the .mpg
cd $OUT/480_dvd_video
mkdir -p dvd_root
VIDEO_FORMAT=NTSC dvdauthor -o dvd_root -T
VIDEO_FORMAT=NTSC dvdauthor -o dvd_root \
-f animation.mpg -t
# 2. stage extras alongside VIDEO_TS/
mkdir -p dvd_root/EXTRAS
cp -r path/to/source dvd_root/EXTRAS/
# 3. pack ISO
mkisofs -dvd-video -V "PARTICLE_LIFE" \
-r -J -o disc.iso dvd_root/
# 4. burn single-session, finalised
growisofs -dvd-compat -Z /dev/sr0=disc.iso
-dvd-compat finalises our disc on first write — single
session, no multi-session firmware quirks. Tested cheap-DVD-player &
Xbox 360 playback 2026-06-29 on Verbatim MCC 03RG20 (Mitsubishi AZO)
blanks.
Why one file
Whole pipeline lives in 337 lines of one Python file. No build
system, no config files, no separate tokenizer/encoder/renderer split.
cupy & ffmpeg are our only runtime
requirements; dvdauthor / mkisofs /
growisofs only enter our pipeline at burn time, not render
time. Every render mode is one literal dict near the top — to add a 4K
Blu-ray or a vertical-phone aspect, copy & edit one preset.
The renderer & our authoring shell steps together fit on one
printed page. That printed page is in our v2 disc's
EXTRAS/py/ directory.
Full source
Self-contained — only stdlib + numpy +
cupy. No project-local imports. Runtime deps: a
cupy-cudaXX wheel matching our CUDA toolkit & an
ffmpeg binary on $PATH. dvdauthor
/ mkisofs / growisofs only enter for the DVD
burn path, not for render.
#!/usr/bin/env python3
"""GPU port of render_seed_480_panoramic.py using CuPy — optimised.
Streams raw uint8 frames directly into ffmpeg's stdin (no intermediate
raw file on disk) so very long renders are bounded by GPU memory, not
storage. At 540,000 frames × 720×480 the un-streamed raw bytes would
have been 186 GB — would not fit on /tmp.
Two top-level parameter sets at the bottom: one for the
panoramic-pan-piece (4K-wide aspect, short duration) and one for the
disc-fill target (TV native aspect, multi-hour duration). Flip the
`MODE` constant to select.
"""
import os
import subprocess
import sys
from pathlib import Path
import numpy as np
import cupy as cp
try:
cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool().malloc)
except Exception:
pass
# --- which render to do --------------------------------------------------
# 'panoramic' : 7680×2160 × 4000 frames (200 sec, pan-piece, ~2 GB)
# 'dvd_mp4' : 720×480 × 540,000 frames (7.5 hours, data DVD-R 4.7 GB)
# 'dvd_video' : 720×480 × 216,000 frames (2 hours, real DVD-Video, any player)
# 'bd_4k_6h' : 3840×2160 × 432,000 frames (6 hours, fills BD-R 25 GB)
MODE = 'dvd_video'
_PRESETS = {
'panoramic': dict(width=7680, height=2160, N=115000, iters=4000,
crf=23, aspect=None),
# DVD-R MP4 fill (data disc, NOT DVD-Video). ~1.4 Mbps avg h264.
'dvd_mp4': dict(width=720, height=480, N=2500, iters=540_000,
bitrate='1400k', maxrate='2500k',
bufsize='5000k', aspect='16:9'),
# Real DVD-Video — bootable in any DVD player + Xbox 360.
# 720×480 NTSC, ~30 fps, MPEG-2 + silent AC-3 audio, DVD-PS muxer.
# Uses ffmpeg's `-target ntsc-dvd` which fixes codec/bitrate/GOP/mux
# to spec. Output is a .mpg (MPEG-2 Program Stream); dvdauthor
# turns that into the VIDEO_TS/ folder you burn to disc.
# DVD-R single layer = 4.7 GB marketed = 4.38 GiB binary. Target
# final file at ~4.2 GiB so dvdauthor's IFO/BUP overhead still fits.
# `-target ntsc-dvd` defaults to 6 Mbps video + 448 kbps audio
# (5.8 GB total over 2 hr — too big). Audio is SILENT so 192 kbps
# is plenty; the savings go to the video budget.
'dvd_video': dict(width=720, height=480, N=2500,
iters=216_000, fps=30,
dvd_target='ntsc-dvd',
video_bitrate='4400k',
audio_bitrate='192k',
aspect='16:9',
out_ext='.mpg', add_silent_audio=True),
# BD-R fill (~25 GB) at 3840×2160, 6 hr → ~9.3 Mbps avg, HEVC.
'bd_4k_6h': dict(width=3840, height=2160, N=115000, iters=432_000,
bitrate='9000k', maxrate='15000k',
bufsize='30000k', aspect=None, codec='libx265'),
}
PARAMS = _PRESETS[MODE]
# --- splat kernel (one CUDA launch per frame) ----------------------------
_SPLAT_KERNEL = cp.RawKernel(r"""
extern "C" __global__
void splat(const float* __restrict__ pos,
float* __restrict__ buf,
const float* __restrict__ kern,
const int N, const int H, const int W) {
int p = blockIdx.x * blockDim.x + threadIdx.x;
if (p >= N) return;
float px = pos[p * 2 + 1];
float py = pos[p * 2];
int ix = ((int)px % W + W) % W;
int iy = ((int)py % H + H) % H;
#pragma unroll
for (int dy = -2; dy <= 2; ++dy) {
int ky = ((iy + dy) % H + H) % H;
#pragma unroll
for (int dx = -2; dx <= 2; ++dx) {
int kx = ((ix + dx) % W + W) % W;
float w = kern[(dy + 2) * 5 + (dx + 2)];
atomicAdd(&buf[ky * W + kx], w);
}
}
}
""", 'splat')
_KERN_HOST = np.array([
[0.05, 0.20, 0.30, 0.20, 0.05],
[0.20, 0.60, 0.85, 0.60, 0.20],
[0.30, 0.85, 1.00, 0.85, 0.30],
[0.20, 0.60, 0.85, 0.60, 0.20],
[0.05, 0.20, 0.30, 0.20, 0.05],
], dtype=np.float32)
KERN_GPU = cp.asarray(_KERN_HOST.ravel())
_DY_OFFSETS = cp.asarray(np.array([-1, -1, -1, 0, 0, 0, 1, 1, 1],
dtype=np.int32))
_DX_OFFSETS = cp.asarray(np.array([-1, 0, 1, -1, 0, 1, -1, 0, 1],
dtype=np.int32))
def render_points(buf, pos, S):
buf *= 0.80
H, W = int(S[0]), int(S[1])
threads = 256
blocks = (pos.shape[0] + threads - 1) // threads
_SPLAT_KERNEL((blocks,), (threads,),
(pos.astype(cp.float32), buf, KERN_GPU,
pos.shape[0], H, W))
buf_max = float(buf.max())
bright = cp.clip(cp.log1p(buf) / cp.log1p(buf_max + 1e-9) * 255,
0, 255).astype(cp.uint8)
return cp.asnumpy(bright)
@cp.fuse()
def _force_value(rn, Aij_pairs, beta):
rep = rn / beta - 1.0
att = Aij_pairs * (1.0 - cp.abs(2.0 * rn - 1.0 - beta) / (1.0 - beta))
return cp.where(rn < beta, rep, cp.where(rn < 1.0, att, 0.0))
def particle_life_stream(width, height, iters, ffmpeg_stdin,
N=2000, K=5, seed=480):
"""Run the simulation; write each uint8 grayscale frame directly
to the supplied open file/pipe (so encoding can happen in parallel
with simulation and no large raw file is materialised)."""
rng = np.random.default_rng(seed)
S_host = np.array([height, width], dtype=np.float32)
S = cp.asarray(S_host)
pos = cp.asarray(rng.random((N, 2)).astype(np.float32)) * S
vel = cp.zeros((N, 2), dtype=cp.float32)
typ = cp.asarray(rng.integers(0, K, N).astype(np.int32))
A = cp.asarray(rng.uniform(-1, 1, (K, K)).astype(np.float32))
rmax = cp.float32(88.0)
rmax_val = 88.0
beta = cp.float32(0.3)
friction = cp.float32(0.85)
fs = cp.float32(0.8)
dt = cp.float32(0.6)
buf = cp.zeros((height, width), dtype=cp.float32)
cell_size = rmax_val
grid_h = int(np.ceil(height / cell_size))
grid_w = int(np.ceil(width / cell_size))
n_cells = grid_h * grid_w
print(f"Rendering {iters} frames at {width}×{height}, N={N} on GPU...",
flush=True)
mem_pool = cp.get_default_memory_pool()
import time
t_start = time.time()
last_report = t_start
for t in range(iters):
cell_y = (pos[:, 0] / cell_size).astype(cp.int32) % grid_h
cell_x = (pos[:, 1] / cell_size).astype(cp.int32) % grid_w
cell_idx = cell_y * grid_w + cell_x
sort_order = cp.argsort(cell_idx)
cell_idx_sorted = cell_idx[sort_order]
all_cells = cp.arange(n_cells, dtype=cp.int32)
cell_start_all = cp.searchsorted(cell_idx_sorted, all_cells,
side='left')
cell_end_all = cp.searchsorted(cell_idx_sorted, all_cells,
side='right')
cell_start = cp.where(cell_end_all > cell_start_all,
cell_start_all, -1)
cell_end = cell_end_all
ny = (cell_y[:, None] + _DY_OFFSETS[None, :]) % grid_h
nx = (cell_x[:, None] + _DX_OFFSETS[None, :]) % grid_w
neighbour_ids = (ny * grid_w + nx).astype(cp.int32)
flat_ne = neighbour_ids.reshape(-1)
starts = cell_start[flat_ne]
ends = cell_end[flat_ne]
counts = cp.where(starts >= 0, ends - starts, 0).astype(cp.int32)
total = int(counts.sum())
per_particle = counts.reshape(N, 9).sum(axis=1)
idx_i = cp.repeat(cp.arange(N, dtype=cp.int32), per_particle)
cum_counts = cp.concatenate(
[cp.zeros(1, dtype=cp.int64),
cp.cumsum(counts.astype(cp.int64))])
positions = cp.arange(total, dtype=cp.int64)
pair_idx = cp.searchsorted(cum_counts[1:], positions,
side='right').astype(cp.int32)
local_offset = (positions - cum_counts[pair_idx]).astype(cp.int32)
sort_idx = starts[pair_idx] + local_offset
idx_j = sort_order[sort_idx].astype(cp.int32)
keep = idx_i != idx_j
idx_i = idx_i[keep]
idx_j = idx_j[keep]
d = pos[idx_j] - pos[idx_i]
cp.subtract(d[:, 0], S[0] * cp.round(d[:, 0] / S[0]), out=d[:, 0])
cp.subtract(d[:, 1], S[1] * cp.round(d[:, 1] / S[1]), out=d[:, 1])
dist = cp.sqrt((d * d).sum(axis=1))
within = dist <= rmax
idx_i = idx_i[within]
idx_j = idx_j[within]
d = d[within]
dist = dist[within]
rn = dist / rmax
dirv = d / (dist[:, None] + cp.float32(1e-9))
Aij_pairs = A[typ[idx_i], typ[idx_j]]
F_val = _force_value(rn, Aij_pairs, beta)
force_contrib = fs * F_val[:, None] * dirv
acc = cp.zeros((N, 2), dtype=cp.float32)
cp.add.at(acc, idx_i, force_contrib)
vel = vel * friction + acc * dt
pos = (pos + vel * dt) % S
frame_host = render_points(buf, pos, S)
ffmpeg_stdin.write(frame_host.tobytes())
if (t & 0x1F) == 0:
mem_pool.free_all_blocks()
# Progress report every ~10 sec wall clock
now = time.time()
if now - last_report > 10:
elapsed = now - t_start
fps = (t + 1) / elapsed
eta = (iters - t - 1) / fps
print(f" frame {t + 1}/{iters} "
f"({(t + 1) / iters * 100:5.1f}%) "
f"fps={fps:5.1f} "
f"eta={eta / 60:5.1f} min", flush=True)
last_report = now
print(f"Completed {iters} frames in {time.time() - t_start:.0f} sec")
if __name__ == "__main__":
seed = 480
P = PARAMS
width, height = P['width'], P['height']
iters, N = P['iters'], P['N']
outdir = (f"/home/fox/Downloads/py/output/even_more/particle_life/"
f"{seed}_{MODE}")
Path(outdir).mkdir(parents=True, exist_ok=True)
fps = P.get('fps', 20)
out_ext = P.get('out_ext', '.mp4')
mp4_path = Path(outdir) / f"animation{out_ext}"
print(f"=== Particle Life Rendering: MODE={MODE} ===")
print(f"Seed: {seed} Resolution: {width}×{height} "
f"Duration: {iters / fps:.1f} sec ({iters} frames @ {fps} fps)")
print(f"N: {N} GPU: "
f"{cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
print()
# ffmpeg command construction. Two distinct flavours: real DVD-Video
# (-target ntsc-dvd, MPEG-2 PS, silent AC-3 audio track) vs the
# h264/x265 .mp4 path. Both stream video frames from stdin so no
# giant raw file is materialised.
if P.get('dvd_target'):
# DVD-Video pipeline — outputs a .mpg you feed to dvdauthor.
cmd = [
'ffmpeg', '-y',
'-f', 'rawvideo', '-pix_fmt', 'gray',
'-s', f'{width}x{height}',
'-framerate', str(fps),
'-i', '-',
]
if P.get('add_silent_audio'):
cmd += ['-f', 'lavfi',
'-i', 'anullsrc=channel_layout=stereo:sample_rate=48000']
cmd += ['-target', P['dvd_target']]
# Override video bitrate that -target ntsc-dvd would default to.
# The DVD-Video max combined bitrate is 10.08 Mbps; we deliberately
# stay well under so we hit a specific final file size.
if P.get('video_bitrate'):
cmd += ['-b:v', P['video_bitrate']]
if P.get('audio_bitrate'):
cmd += ['-b:a', P['audio_bitrate']]
if P.get('aspect'):
cmd += ['-aspect', P['aspect']]
cmd += ['-shortest', str(mp4_path)]
else:
cmd = [
'ffmpeg', '-y',
'-f', 'rawvideo', '-pix_fmt', 'gray',
'-s', f'{width}x{height}',
'-framerate', str(fps),
'-i', '-',
'-c:v', P.get('codec', 'libx264'),
'-pix_fmt', 'yuv420p',
'-preset', 'medium',
]
if P.get('bitrate'):
cmd += ['-b:v', P['bitrate']]
if P.get('maxrate'):
cmd += ['-maxrate', P['maxrate'], '-bufsize', P['bufsize']]
elif P.get('crf') is not None:
cmd += ['-crf', str(P['crf'])]
if P.get('aspect'):
cmd += ['-aspect', P['aspect']]
cmd += [str(mp4_path)]
print(f"ffmpeg cmd: {' '.join(cmd)}")
print()
ff = subprocess.Popen(cmd, stdin=subprocess.PIPE,
stderr=subprocess.DEVNULL)
try:
particle_life_stream(width, height, iters, ff.stdin,
N=N, seed=seed)
finally:
ff.stdin.close()
ff.wait()
sz = mp4_path.stat().st_size if mp4_path.exists() else 0
print(f"\nMP4 saved: {mp4_path} ({sz / 1024**3:.2f} GB, "
f"{sz / 1024**2:.1f} MB)")
print(f"\n=== Complete ===")Related pages
- 3-GPU mesh — hardware context for the 4090 this script targets.
fox@neoblanka:~/git/uncloseai-cli$
'-preset', 'medium', ] if P.get('bitrate'): cmd += ['-b:v', P['bitrate']] if P.get('maxrate'): cmd += ['-maxrate', P['maxrate'], '-bufsize', P['bufsize']] elif P.get('crf') is not None: cmd += ['-crf', str(P['crf'])] if P.get('aspect'): cmd += ['-aspect', P['aspect']] cmd += [str(mp4_path)]
print(f"ffmpeg cmd: {' '.join(cmd)}") print() ff = subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL) try: particle_life_stream(width, height, iters, ff.stdin, N=N, seed=seed) finally: ff.stdin.close() ff.wait()
sz = mp4_path.stat().st_size if mp4_path.exists() else 0 print(f"nMP4 saved: {mp4_path} ({sz / 1024*3:.2f} GB, " f"{sz / 1024*2:.1f} MB)") print(f"n=== Complete ===")
Related pages
- 3-GPU mesh — hardware context for the 4090 this script targets.
Remarkbox
Comments