Particle Life on GPU CuPy port for long-form renders Source: https://foxhop.net/0b6d520b-7414-11f1-9565-040140774501/particle-life-on-gpu-cupy-port-for-long-form-renders Snapshot: 2026-06-30T04:04:07Z Generator: Remarkbox dba4024 This is a thread snapshot. The living document lives at the source URI above — it may have been edited, extended, or replied-to since. [Scan for living source] Particle Life on GPU CuPy port for long-form renders Code: render_seed_480_panoramic_gpu.py (337 lines, single file) Engine: CuPy + one custom CUDA splat kernel Host: ai.foxhop.net (RTX 4090, sm_89, 24 GB) Use: render seed-480 particle-life to MP4, MPEG-2 (DVD-Video), or HEVC for BD-R ------------------------------------------------------------------------ ------------------------------------------------------------------------ What this script renders Particle Life is a 2-D agent toy: N particles, K species, an asymmetric K×K force matrix, periodic torus boundary on both axes. One seed value fully determines the matrix, initial positions, & species assignments, so a given seed gives the same evolving universe every time on every GPU bit-for-bit. We picked seed 480 because its matrix produces a visually active universe that never stalls into a fixed point or limit cycle for hours of simulation time. Single CuPy file drives four output targets via one MODE switch. Each preset is a sealed dict at the top of our script — flip the constant, run, get our target file. No CLI args needed. ------------------------------------------------------------------------ Output modes mode resolution N particles duration target file ----------- ------------- ------------- ---------- ------------------------------ panoramic 7680 × 2160 115 000 200 s h264 .mp4 (~2 GB) dvd_mp4 720 × 480 2 500 7.5 h h264 .mp4 on data DVD-R dvd_video 720 × 480 2 500 2 h MPEG-2 .mpg → real DVD-Video bd_4k_6h 3840 × 2160 115 000 6 h HEVC .mp4 on BD-R dvd_video is our most-tested target: ffmpeg -target ntsc-dvd produces a spec-compliant MPEG-2 Program Stream with a silent AC-3 audio track that dvdauthor turns into a VIDEO_TS/ folder. That folder gets mkisofs -dvd-video packed into an ISO & growisofs -dvd-compat -Z burnt to a Verbatim DVD-R blank. Bootable on cheap DVD players & Xbox 360, verified on both 2026-06-29. ------------------------------------------------------------------------ The four optimisations that mattered Naïve CuPy port of our CPU renderer hit OOM on our 4090 at N=22 000 (peak 19.4 GB of 24 GB). Four changes brought peak down to 2.8 GB at the same N & raised throughput an order of magnitude: 1. Custom CUDA splat kernel — one launch per frame. The 5×5 soft-particle splat was 25 np.add.at ops in our CPU code, which became 25 sequential cp.add.at kernel launches per frame. We replaced our entire splat with one cp.RawKernel that takes (pos, buf, kern, N, H, W) & atomically adds the 25 contributions for every particle inside one launch. The whole 720×480 frame splat now costs one kernel dispatch instead of 25. 2. ``cp.fuse`` on the force formula. The piecewise force f(rn) — repulsive below β, attractive between β & 1, zero beyond — was three array passes in our pure-CuPy version. Wrapping it in @cp.fuse() lets CuPy generate one fused elementwise CUDA kernel that does the whole branch on one read of rn, Aij_pairs and one write to the output. 3. Raw bytes straight into ffmpeg's stdin. Our CPU script wrote PNGs to disk & let ffmpeg read them back. At 540 000 frames × 720×480 grayscale that would have been 186 GB of raw data on /tmp — wouldn't fit. We open ffmpeg as a subprocess with stdin=PIPE, send each frame as frame.tobytes(), & ffmpeg encodes in parallel with our simulation. Disk usage stays bounded at our final output size (~4 GB for the 2-hour DVD-Video). 4. CUDA async memory pool + periodic ``free_all_blocks``. cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool().malloc) lets the runtime overlap allocation with kernels. mem_pool.free_all_blocks() fires every 32 frames so peak allocation doesn't climb monotonically across our 216 000-frame render. The two together kept us under 3 GB peak across our entire 2-hour render. ------------------------------------------------------------------------ Benchmarks Wall-clock fps on RTX 4090, seed 480, our four presets: +----------------+----------------+----------------+----------------+ | mode | resolution | N | wall fps | +================+================+================+================+ | CPU baseline | 2880 × 1080 | 22 000 | 6 – 7 fps | | (panoramic) | | | | +----------------+----------------+----------------+----------------+ | GPU panoramic | 7680 × 2160 | 115 000 | 33 fps | +----------------+----------------+----------------+----------------+ | GPU dvd_video | 720 × 480 | 2 500 | 326 fps | +----------------+----------------+----------------+----------------+ The 2-hour DVD-Video (216 000 frames) renders in about 11 minutes wall time on our 4090. The CPU-version equivalent at the same resolution & N would have taken roughly 9 hours. ------------------------------------------------------------------------ Simulation kernel (short recap) Per particle, per frame: 1. Spatial hash — bin all N particles by floor(pos / rmax) into a grid; sort by cell index; cp.searchsorted gives a cell_start[cell_id] array in one pass. 2. Candidate pairs — for each particle, look up its 3×3 neighbour cells; concatenate every particle in those cells as a candidate for force interaction. O(N) instead of O(N²). 3. Distance + torus wrap — apply periodic boundary in both x & y by d -= S * round(d / S). 4. Force — fused kernel returns the piecewise force value; multiply by direction & species-pair coefficient A[typ_i, typ_j]; scatter back to acc via cp.add.at. 5. Integrate — vel = vel * friction + acc * dt then pos = (pos + vel * dt) % S. 6. Render — decay buffer, splat 5×5 gaussian per particle, log normalise, emit one uint8 grayscale frame to ffmpeg stdin. friction = 0.85, dt = 0.6, fs = 0.8, rmax = 88, beta = 0.3, K = 5. Buffer decays at 0.80 between frames which gives the visible motion trails. ------------------------------------------------------------------------ Burn pipeline (DVD-Video target) Once MODE = 'dvd_video' produces animation.mpg (~3.9 GB), four shell steps cut a bootable DVD: # 1. author VIDEO_TS/ from the .mpg cd $OUT/480_dvd_video mkdir -p dvd_root VIDEO_FORMAT=NTSC dvdauthor -o dvd_root -T VIDEO_FORMAT=NTSC dvdauthor -o dvd_root \ -f animation.mpg -t # 2. stage extras alongside VIDEO_TS/ mkdir -p dvd_root/EXTRAS cp -r path/to/source dvd_root/EXTRAS/ # 3. pack ISO mkisofs -dvd-video -V "PARTICLE_LIFE" \ -r -J -o disc.iso dvd_root/ # 4. burn single-session, finalised growisofs -dvd-compat -Z /dev/sr0=disc.iso -dvd-compat finalises our disc on first write — single session, no multi-session firmware quirks. Tested cheap-DVD-player & Xbox 360 playback 2026-06-29 on Verbatim MCC 03RG20 (Mitsubishi AZO) blanks. ------------------------------------------------------------------------ Why one file Whole pipeline lives in 337 lines of one Python file. No build system, no config files, no separate tokenizer/encoder/renderer split. cupy & ffmpeg are our only runtime requirements; dvdauthor / mkisofs / growisofs only enter our pipeline at burn time, not render time. Every render mode is one literal dict near the top — to add a 4K Blu-ray or a vertical-phone aspect, copy & edit one preset. The renderer & our authoring shell steps together fit on one printed page. That printed page is in our v2 disc's EXTRAS/py/ directory. ------------------------------------------------------------------------ Full source Self-contained — only stdlib + numpy + cupy. No project-local imports. Runtime deps: a cupy-cudaXX wheel matching our CUDA toolkit & an ffmpeg binary on $PATH. dvdauthor / mkisofs / growisofs only enter for the DVD burn path, not for render. #!/usr/bin/env python3 """GPU port of render_seed_480_panoramic.py using CuPy — optimised. Streams raw uint8 frames directly into ffmpeg's stdin (no intermediate raw file on disk) so very long renders are bounded by GPU memory, not storage. At 540,000 frames × 720×480 the un-streamed raw bytes would have been 186 GB — would not fit on /tmp. Two top-level parameter sets at the bottom: one for the panoramic-pan-piece (4K-wide aspect, short duration) and one for the disc-fill target (TV native aspect, multi-hour duration). Flip the `MODE` constant to select. """ import os import subprocess import sys from pathlib import Path import numpy as np import cupy as cp try: cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool().malloc) except Exception: pass # --- which render to do -------------------------------------------------- # 'panoramic' : 7680×2160 × 4000 frames (200 sec, pan-piece, ~2 GB) # 'dvd_mp4' : 720×480 × 540,000 frames (7.5 hours, data DVD-R 4.7 GB) # 'dvd_video' : 720×480 × 216,000 frames (2 hours, real DVD-Video, any player) # 'bd_4k_6h' : 3840×2160 × 432,000 frames (6 hours, fills BD-R 25 GB) MODE = 'dvd_video' _PRESETS = { 'panoramic': dict(width=7680, height=2160, N=115000, iters=4000, crf=23, aspect=None), # DVD-R MP4 fill (data disc, NOT DVD-Video). ~1.4 Mbps avg h264. 'dvd_mp4': dict(width=720, height=480, N=2500, iters=540_000, bitrate='1400k', maxrate='2500k', bufsize='5000k', aspect='16:9'), # Real DVD-Video — bootable in any DVD player + Xbox 360. # 720×480 NTSC, ~30 fps, MPEG-2 + silent AC-3 audio, DVD-PS muxer. # Uses ffmpeg's `-target ntsc-dvd` which fixes codec/bitrate/GOP/mux # to spec. Output is a .mpg (MPEG-2 Program Stream); dvdauthor # turns that into the VIDEO_TS/ folder you burn to disc. # DVD-R single layer = 4.7 GB marketed = 4.38 GiB binary. Target # final file at ~4.2 GiB so dvdauthor's IFO/BUP overhead still fits. # `-target ntsc-dvd` defaults to 6 Mbps video + 448 kbps audio # (5.8 GB total over 2 hr — too big). Audio is SILENT so 192 kbps # is plenty; the savings go to the video budget. 'dvd_video': dict(width=720, height=480, N=2500, iters=216_000, fps=30, dvd_target='ntsc-dvd', video_bitrate='4400k', audio_bitrate='192k', aspect='16:9', out_ext='.mpg', add_silent_audio=True), # BD-R fill (~25 GB) at 3840×2160, 6 hr → ~9.3 Mbps avg, HEVC. 'bd_4k_6h': dict(width=3840, height=2160, N=115000, iters=432_000, bitrate='9000k', maxrate='15000k', bufsize='30000k', aspect=None, codec='libx265'), } PARAMS = _PRESETS[MODE] # --- splat kernel (one CUDA launch per frame) ---------------------------- _SPLAT_KERNEL = cp.RawKernel(r""" extern "C" __global__ void splat(const float* __restrict__ pos, float* __restrict__ buf, const float* __restrict__ kern, const int N, const int H, const int W) { int p = blockIdx.x * blockDim.x + threadIdx.x; if (p >= N) return; float px = pos[p * 2 + 1]; float py = pos[p * 2]; int ix = ((int)px % W + W) % W; int iy = ((int)py % H + H) % H; #pragma unroll for (int dy = -2; dy <= 2; ++dy) { int ky = ((iy + dy) % H + H) % H; #pragma unroll for (int dx = -2; dx <= 2; ++dx) { int kx = ((ix + dx) % W + W) % W; float w = kern[(dy + 2) * 5 + (dx + 2)]; atomicAdd(&buf[ky * W + kx], w); } } } """, 'splat') _KERN_HOST = np.array([ [0.05, 0.20, 0.30, 0.20, 0.05], [0.20, 0.60, 0.85, 0.60, 0.20], [0.30, 0.85, 1.00, 0.85, 0.30], [0.20, 0.60, 0.85, 0.60, 0.20], [0.05, 0.20, 0.30, 0.20, 0.05], ], dtype=np.float32) KERN_GPU = cp.asarray(_KERN_HOST.ravel()) _DY_OFFSETS = cp.asarray(np.array([-1, -1, -1, 0, 0, 0, 1, 1, 1], dtype=np.int32)) _DX_OFFSETS = cp.asarray(np.array([-1, 0, 1, -1, 0, 1, -1, 0, 1], dtype=np.int32)) def render_points(buf, pos, S): buf *= 0.80 H, W = int(S[0]), int(S[1]) threads = 256 blocks = (pos.shape[0] + threads - 1) // threads _SPLAT_KERNEL((blocks,), (threads,), (pos.astype(cp.float32), buf, KERN_GPU, pos.shape[0], H, W)) buf_max = float(buf.max()) bright = cp.clip(cp.log1p(buf) / cp.log1p(buf_max + 1e-9) * 255, 0, 255).astype(cp.uint8) return cp.asnumpy(bright) @cp.fuse() def _force_value(rn, Aij_pairs, beta): rep = rn / beta - 1.0 att = Aij_pairs * (1.0 - cp.abs(2.0 * rn - 1.0 - beta) / (1.0 - beta)) return cp.where(rn < beta, rep, cp.where(rn < 1.0, att, 0.0)) def particle_life_stream(width, height, iters, ffmpeg_stdin, N=2000, K=5, seed=480): """Run the simulation; write each uint8 grayscale frame directly to the supplied open file/pipe (so encoding can happen in parallel with simulation and no large raw file is materialised).""" rng = np.random.default_rng(seed) S_host = np.array([height, width], dtype=np.float32) S = cp.asarray(S_host) pos = cp.asarray(rng.random((N, 2)).astype(np.float32)) * S vel = cp.zeros((N, 2), dtype=cp.float32) typ = cp.asarray(rng.integers(0, K, N).astype(np.int32)) A = cp.asarray(rng.uniform(-1, 1, (K, K)).astype(np.float32)) rmax = cp.float32(88.0) rmax_val = 88.0 beta = cp.float32(0.3) friction = cp.float32(0.85) fs = cp.float32(0.8) dt = cp.float32(0.6) buf = cp.zeros((height, width), dtype=cp.float32) cell_size = rmax_val grid_h = int(np.ceil(height / cell_size)) grid_w = int(np.ceil(width / cell_size)) n_cells = grid_h * grid_w print(f"Rendering {iters} frames at {width}×{height}, N={N} on GPU...", flush=True) mem_pool = cp.get_default_memory_pool() import time t_start = time.time() last_report = t_start for t in range(iters): cell_y = (pos[:, 0] / cell_size).astype(cp.int32) % grid_h cell_x = (pos[:, 1] / cell_size).astype(cp.int32) % grid_w cell_idx = cell_y * grid_w + cell_x sort_order = cp.argsort(cell_idx) cell_idx_sorted = cell_idx[sort_order] all_cells = cp.arange(n_cells, dtype=cp.int32) cell_start_all = cp.searchsorted(cell_idx_sorted, all_cells, side='left') cell_end_all = cp.searchsorted(cell_idx_sorted, all_cells, side='right') cell_start = cp.where(cell_end_all > cell_start_all, cell_start_all, -1) cell_end = cell_end_all ny = (cell_y[:, None] + _DY_OFFSETS[None, :]) % grid_h nx = (cell_x[:, None] + _DX_OFFSETS[None, :]) % grid_w neighbour_ids = (ny * grid_w + nx).astype(cp.int32) flat_ne = neighbour_ids.reshape(-1) starts = cell_start[flat_ne] ends = cell_end[flat_ne] counts = cp.where(starts >= 0, ends - starts, 0).astype(cp.int32) total = int(counts.sum()) per_particle = counts.reshape(N, 9).sum(axis=1) idx_i = cp.repeat(cp.arange(N, dtype=cp.int32), per_particle) cum_counts = cp.concatenate( [cp.zeros(1, dtype=cp.int64), cp.cumsum(counts.astype(cp.int64))]) positions = cp.arange(total, dtype=cp.int64) pair_idx = cp.searchsorted(cum_counts[1:], positions, side='right').astype(cp.int32) local_offset = (positions - cum_counts[pair_idx]).astype(cp.int32) sort_idx = starts[pair_idx] + local_offset idx_j = sort_order[sort_idx].astype(cp.int32) keep = idx_i != idx_j idx_i = idx_i[keep] idx_j = idx_j[keep] d = pos[idx_j] - pos[idx_i] cp.subtract(d[:, 0], S[0] * cp.round(d[:, 0] / S[0]), out=d[:, 0]) cp.subtract(d[:, 1], S[1] * cp.round(d[:, 1] / S[1]), out=d[:, 1]) dist = cp.sqrt((d * d).sum(axis=1)) within = dist <= rmax idx_i = idx_i[within] idx_j = idx_j[within] d = d[within] dist = dist[within] rn = dist / rmax dirv = d / (dist[:, None] + cp.float32(1e-9)) Aij_pairs = A[typ[idx_i], typ[idx_j]] F_val = _force_value(rn, Aij_pairs, beta) force_contrib = fs * F_val[:, None] * dirv acc = cp.zeros((N, 2), dtype=cp.float32) cp.add.at(acc, idx_i, force_contrib) vel = vel * friction + acc * dt pos = (pos + vel * dt) % S frame_host = render_points(buf, pos, S) ffmpeg_stdin.write(frame_host.tobytes()) if (t & 0x1F) == 0: mem_pool.free_all_blocks() # Progress report every ~10 sec wall clock now = time.time() if now - last_report > 10: elapsed = now - t_start fps = (t + 1) / elapsed eta = (iters - t - 1) / fps print(f" frame {t + 1}/{iters} " f"({(t + 1) / iters * 100:5.1f}%) " f"fps={fps:5.1f} " f"eta={eta / 60:5.1f} min", flush=True) last_report = now print(f"Completed {iters} frames in {time.time() - t_start:.0f} sec") if __name__ == "__main__": seed = 480 P = PARAMS width, height = P['width'], P['height'] iters, N = P['iters'], P['N'] outdir = (f"/home/fox/Downloads/py/output/even_more/particle_life/" f"{seed}_{MODE}") Path(outdir).mkdir(parents=True, exist_ok=True) fps = P.get('fps', 20) out_ext = P.get('out_ext', '.mp4') mp4_path = Path(outdir) / f"animation{out_ext}" print(f"=== Particle Life Rendering: MODE={MODE} ===") print(f"Seed: {seed} Resolution: {width}×{height} " f"Duration: {iters / fps:.1f} sec ({iters} frames @ {fps} fps)") print(f"N: {N} GPU: " f"{cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}") print() # ffmpeg command construction. Two distinct flavours: real DVD-Video # (-target ntsc-dvd, MPEG-2 PS, silent AC-3 audio track) vs the # h264/x265 .mp4 path. Both stream video frames from stdin so no # giant raw file is materialised. if P.get('dvd_target'): # DVD-Video pipeline — outputs a .mpg you feed to dvdauthor. cmd = [ 'ffmpeg', '-y', '-f', 'rawvideo', '-pix_fmt', 'gray', '-s', f'{width}x{height}', '-framerate', str(fps), '-i', '-', ] if P.get('add_silent_audio'): cmd += ['-f', 'lavfi', '-i', 'anullsrc=channel_layout=stereo:sample_rate=48000'] cmd += ['-target', P['dvd_target']] # Override video bitrate that -target ntsc-dvd would default to. # The DVD-Video max combined bitrate is 10.08 Mbps; we deliberately # stay well under so we hit a specific final file size. if P.get('video_bitrate'): cmd += ['-b:v', P['video_bitrate']] if P.get('audio_bitrate'): cmd += ['-b:a', P['audio_bitrate']] if P.get('aspect'): cmd += ['-aspect', P['aspect']] cmd += ['-shortest', str(mp4_path)] else: cmd = [ 'ffmpeg', '-y', '-f', 'rawvideo', '-pix_fmt', 'gray', '-s', f'{width}x{height}', '-framerate', str(fps), '-i', '-', '-c:v', P.get('codec', 'libx264'), '-pix_fmt', 'yuv420p', '-preset', 'medium', ] if P.get('bitrate'): cmd += ['-b:v', P['bitrate']] if P.get('maxrate'): cmd += ['-maxrate', P['maxrate'], '-bufsize', P['bufsize']] elif P.get('crf') is not None: cmd += ['-crf', str(P['crf'])] if P.get('aspect'): cmd += ['-aspect', P['aspect']] cmd += [str(mp4_path)] print(f"ffmpeg cmd: {' '.join(cmd)}") print() ff = subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL) try: particle_life_stream(width, height, iters, ff.stdin, N=N, seed=seed) finally: ff.stdin.close() ff.wait() sz = mp4_path.stat().st_size if mp4_path.exists() else 0 print(f"\nMP4 saved: {mp4_path} ({sz / 1024**3:.2f} GB, " f"{sz / 1024**2:.1f} MB)") print(f"\n=== Complete ===") ------------------------------------------------------------------------ Related pages - 3-GPU mesh — hardware context for the 4090 this script targets. ------------------------------------------------------------------------ Updated 2026-06-29. fox@neoblanka:~/git/uncloseai-cli$ '-preset', 'medium', ] if P.get('bitrate'): cmd += ['-b:v', P['bitrate']] if P.get('maxrate'): cmd += ['-maxrate', P['maxrate'], '-bufsize', P['bufsize']] elif P.get('crf') is not None: cmd += ['-crf', str(P['crf'])] if P.get('aspect'): cmd += ['-aspect', P['aspect']] cmd += [str(mp4_path)] print(f"ffmpeg cmd: {' '.join(cmd)}") print() ff = subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL) try: particle_life_stream(width, height, iters, ff.stdin, N=N, seed=seed) finally: ff.stdin.close() ff.wait() sz = mp4_path.stat().st_size if mp4_path.exists() else 0 print(f"nMP4 saved: {mp4_path} ({sz / 1024*3:.2f} GB, " f"{sz / 1024*2:.1f} MB)") print(f"n=== Complete ===") ------------------------------------------------------------------------ Related pages - 3-GPU mesh — hardware context for the 4090 this script targets. ------------------------------------------------------------------------ Source: https://foxhop.net/0b6d520b-7414-11f1-9565-040140774501/particle-life-on-gpu-cupy-port-for-long-form-renders Snapshot: 2026-06-30T04:04:07Z Generator: Remarkbox dba4024