Particle Life on GPU CuPy port for long-form renders


  Source:
  https://foxhop.net/0b6d520b-7414-11f1-9565-040140774501/particle-life-on-gpu-cupy-port-for-long-form-renders
  Snapshot: 2026-06-30T04:04:07Z
  Generator: Remarkbox dba4024

  This is a thread snapshot. The living document lives at the source URI
  above — it may have been edited, extended, or replied-to since.

[Scan for living source]

Particle Life on GPU CuPy port for long-form renders

Code: render_seed_480_panoramic_gpu.py (337 lines, single file)
Engine: CuPy + one custom CUDA splat kernel
Host: ai.foxhop.net (RTX 4090, sm_89, 24 GB)
Use: render seed-480 particle-life to MP4, MPEG-2 (DVD-Video), or HEVC
for BD-R

------------------------------------------------------------------------

------------------------------------------------------------------------

What this script renders

Particle Life is a 2-D agent toy: N particles, K species, an asymmetric
K×K force matrix, periodic torus boundary on both axes. One seed value
fully determines the matrix, initial positions, & species assignments,
so a given seed gives the same evolving universe every time on every GPU
bit-for-bit. We picked seed 480 because its matrix produces a visually
active universe that never stalls into a fixed point or limit cycle for
hours of simulation time.

Single CuPy file drives four output targets via one MODE switch. Each
preset is a sealed dict at the top of our script — flip the constant,
run, get our target file. No CLI args needed.

------------------------------------------------------------------------

Output modes

  mode        resolution    N particles   duration   target file
  ----------- ------------- ------------- ---------- ------------------------------
  panoramic   7680 × 2160   115 000       200 s      h264 .mp4 (~2 GB)
  dvd_mp4     720 × 480     2 500         7.5 h      h264 .mp4 on data DVD-R
  dvd_video   720 × 480     2 500         2 h        MPEG-2 .mpg → real DVD-Video
  bd_4k_6h    3840 × 2160   115 000       6 h        HEVC .mp4 on BD-R

dvd_video is our most-tested target: ffmpeg -target ntsc-dvd produces a
spec-compliant MPEG-2 Program Stream with a silent AC-3 audio track that
dvdauthor turns into a VIDEO_TS/ folder. That folder gets
mkisofs -dvd-video packed into an ISO & growisofs -dvd-compat -Z burnt
to a Verbatim DVD-R blank. Bootable on cheap DVD players & Xbox 360,
verified on both 2026-06-29.

------------------------------------------------------------------------

The four optimisations that mattered

Naïve CuPy port of our CPU renderer hit OOM on our 4090 at N=22 000
(peak 19.4 GB of 24 GB). Four changes brought peak down to 2.8 GB at the
same N & raised throughput an order of magnitude:

1. Custom CUDA splat kernel — one launch per frame.

The 5×5 soft-particle splat was 25 np.add.at ops in our CPU code, which
became 25 sequential cp.add.at kernel launches per frame. We replaced
our entire splat with one cp.RawKernel that takes
(pos, buf, kern, N, H, W) & atomically adds the 25 contributions for
every particle inside one launch. The whole 720×480 frame splat now
costs one kernel dispatch instead of 25.

2. ``cp.fuse`` on the force formula.

The piecewise force f(rn) — repulsive below β, attractive between β & 1,
zero beyond — was three array passes in our pure-CuPy version. Wrapping
it in @cp.fuse() lets CuPy generate one fused elementwise CUDA kernel
that does the whole branch on one read of rn, Aij_pairs and one write to
the output.

3. Raw bytes straight into ffmpeg's stdin.

Our CPU script wrote PNGs to disk & let ffmpeg read them back. At 540
000 frames × 720×480 grayscale that would have been 186 GB of raw data
on /tmp — wouldn't fit. We open ffmpeg as a subprocess with stdin=PIPE,
send each frame as frame.tobytes(), & ffmpeg encodes in parallel with
our simulation. Disk usage stays bounded at our final output size (~4 GB
for the 2-hour DVD-Video).

4. CUDA async memory pool + periodic ``free_all_blocks``.

cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool().malloc) lets the runtime
overlap allocation with kernels. mem_pool.free_all_blocks() fires every
32 frames so peak allocation doesn't climb monotonically across our 216
000-frame render. The two together kept us under 3 GB peak across our
entire 2-hour render.

------------------------------------------------------------------------

Benchmarks

Wall-clock fps on RTX 4090, seed 480, our four presets:

+----------------+----------------+----------------+----------------+
| mode           | resolution     | N              | wall fps       |
+================+================+================+================+
| CPU baseline   | 2880 × 1080    | 22 000         |   6 – 7 fps    |
| (panoramic)    |                |                |                |
+----------------+----------------+----------------+----------------+
| GPU panoramic  | 7680 × 2160    | 115 000        |   33 fps       |
+----------------+----------------+----------------+----------------+
| GPU dvd_video  | 720 × 480      | 2 500          | 326 fps        |
+----------------+----------------+----------------+----------------+

The 2-hour DVD-Video (216 000 frames) renders in about 11 minutes wall
time on our 4090. The CPU-version equivalent at the same resolution & N
would have taken roughly 9 hours.

------------------------------------------------------------------------

Simulation kernel (short recap)

Per particle, per frame:

1.  Spatial hash — bin all N particles by floor(pos / rmax) into a grid;
    sort by cell index; cp.searchsorted gives a cell_start[cell_id]
    array in one pass.
2.  Candidate pairs — for each particle, look up its 3×3 neighbour
    cells; concatenate every particle in those cells as a candidate for
    force interaction. O(N) instead of O(N²).
3.  Distance + torus wrap — apply periodic boundary in both x & y by
    d -= S * round(d / S).
4.  Force — fused kernel returns the piecewise force value; multiply by
    direction & species-pair coefficient A[typ_i, typ_j]; scatter back
    to acc via cp.add.at.
5.  Integrate — vel = vel * friction + acc * dt then
    pos = (pos + vel * dt) % S.
6.  Render — decay buffer, splat 5×5 gaussian per particle, log
    normalise, emit one uint8 grayscale frame to ffmpeg stdin.

friction = 0.85, dt = 0.6, fs = 0.8, rmax = 88, beta = 0.3, K = 5.
Buffer decays at 0.80 between frames which gives the visible motion
trails.

------------------------------------------------------------------------

Burn pipeline (DVD-Video target)

Once MODE = 'dvd_video' produces animation.mpg (~3.9 GB), four shell
steps cut a bootable DVD:

    # 1. author VIDEO_TS/ from the .mpg
    cd $OUT/480_dvd_video
    mkdir -p dvd_root
    VIDEO_FORMAT=NTSC dvdauthor -o dvd_root -T
    VIDEO_FORMAT=NTSC dvdauthor -o dvd_root \
        -f animation.mpg -t

    # 2. stage extras alongside VIDEO_TS/
    mkdir -p dvd_root/EXTRAS
    cp -r path/to/source dvd_root/EXTRAS/

    # 3. pack ISO
    mkisofs -dvd-video -V "PARTICLE_LIFE" \
        -r -J -o disc.iso dvd_root/

    # 4. burn single-session, finalised
    growisofs -dvd-compat -Z /dev/sr0=disc.iso

-dvd-compat finalises our disc on first write — single session, no
multi-session firmware quirks. Tested cheap-DVD-player & Xbox 360
playback 2026-06-29 on Verbatim MCC 03RG20 (Mitsubishi AZO) blanks.

------------------------------------------------------------------------

Why one file

Whole pipeline lives in 337 lines of one Python file. No build system,
no config files, no separate tokenizer/encoder/renderer split. cupy &
ffmpeg are our only runtime requirements; dvdauthor / mkisofs /
growisofs only enter our pipeline at burn time, not render time. Every
render mode is one literal dict near the top — to add a 4K Blu-ray or a
vertical-phone aspect, copy & edit one preset.

The renderer & our authoring shell steps together fit on one printed
page. That printed page is in our v2 disc's EXTRAS/py/ directory.

------------------------------------------------------------------------

Full source

Self-contained — only stdlib + numpy + cupy. No project-local imports.
Runtime deps: a cupy-cudaXX wheel matching our CUDA toolkit & an ffmpeg
binary on $PATH. dvdauthor / mkisofs / growisofs only enter for the DVD
burn path, not for render.

    #!/usr/bin/env python3
    """GPU port of render_seed_480_panoramic.py using CuPy — optimised.

    Streams raw uint8 frames directly into ffmpeg's stdin (no intermediate
    raw file on disk) so very long renders are bounded by GPU memory, not
    storage. At 540,000 frames × 720×480 the un-streamed raw bytes would
    have been 186 GB — would not fit on /tmp.

    Two top-level parameter sets at the bottom: one for the
    panoramic-pan-piece (4K-wide aspect, short duration) and one for the
    disc-fill target (TV native aspect, multi-hour duration).  Flip the
    `MODE` constant to select.
    """
    import os
    import subprocess
    import sys
    from pathlib import Path
    import numpy as np
    import cupy as cp


    try:
        cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool().malloc)
    except Exception:
        pass


    # --- which render to do --------------------------------------------------
    # 'panoramic'  : 7680×2160 × 4000 frames  (200 sec, pan-piece, ~2 GB)
    # 'dvd_mp4'    : 720×480 × 540,000 frames (7.5 hours, data DVD-R 4.7 GB)
    # 'dvd_video'  : 720×480 × 216,000 frames (2 hours, real DVD-Video, any player)
    # 'bd_4k_6h'   : 3840×2160 × 432,000 frames (6 hours, fills BD-R 25 GB)
    MODE = 'dvd_video'


    _PRESETS = {
        'panoramic': dict(width=7680, height=2160, N=115000, iters=4000,
                          crf=23, aspect=None),
        # DVD-R MP4 fill (data disc, NOT DVD-Video).  ~1.4 Mbps avg h264.
        'dvd_mp4':   dict(width=720,  height=480,  N=2500,   iters=540_000,
                          bitrate='1400k', maxrate='2500k',
                          bufsize='5000k', aspect='16:9'),
        # Real DVD-Video — bootable in any DVD player + Xbox 360.
        # 720×480 NTSC, ~30 fps, MPEG-2 + silent AC-3 audio, DVD-PS muxer.
        # Uses ffmpeg's `-target ntsc-dvd` which fixes codec/bitrate/GOP/mux
        # to spec.  Output is a .mpg (MPEG-2 Program Stream); dvdauthor
        # turns that into the VIDEO_TS/ folder you burn to disc.
        # DVD-R single layer = 4.7 GB marketed = 4.38 GiB binary.  Target
        # final file at ~4.2 GiB so dvdauthor's IFO/BUP overhead still fits.
        # `-target ntsc-dvd` defaults to 6 Mbps video + 448 kbps audio
        # (5.8 GB total over 2 hr — too big).  Audio is SILENT so 192 kbps
        # is plenty; the savings go to the video budget.
        'dvd_video': dict(width=720, height=480, N=2500,
                          iters=216_000, fps=30,
                          dvd_target='ntsc-dvd',
                          video_bitrate='4400k',
                          audio_bitrate='192k',
                          aspect='16:9',
                          out_ext='.mpg', add_silent_audio=True),
        # BD-R fill (~25 GB) at 3840×2160, 6 hr → ~9.3 Mbps avg, HEVC.
        'bd_4k_6h':  dict(width=3840, height=2160, N=115000, iters=432_000,
                          bitrate='9000k', maxrate='15000k',
                          bufsize='30000k', aspect=None, codec='libx265'),
    }

    PARAMS = _PRESETS[MODE]


    # --- splat kernel (one CUDA launch per frame) ----------------------------
    _SPLAT_KERNEL = cp.RawKernel(r"""
    extern "C" __global__
    void splat(const float* __restrict__ pos,
               float* __restrict__ buf,
               const float* __restrict__ kern,
               const int N, const int H, const int W) {
        int p = blockIdx.x * blockDim.x + threadIdx.x;
        if (p >= N) return;
        float px = pos[p * 2 + 1];
        float py = pos[p * 2];
        int ix = ((int)px % W + W) % W;
        int iy = ((int)py % H + H) % H;
        #pragma unroll
        for (int dy = -2; dy <= 2; ++dy) {
            int ky = ((iy + dy) % H + H) % H;
            #pragma unroll
            for (int dx = -2; dx <= 2; ++dx) {
                int kx = ((ix + dx) % W + W) % W;
                float w = kern[(dy + 2) * 5 + (dx + 2)];
                atomicAdd(&buf[ky * W + kx], w);
            }
        }
    }
    """, 'splat')


    _KERN_HOST = np.array([
        [0.05, 0.20, 0.30, 0.20, 0.05],
        [0.20, 0.60, 0.85, 0.60, 0.20],
        [0.30, 0.85, 1.00, 0.85, 0.30],
        [0.20, 0.60, 0.85, 0.60, 0.20],
        [0.05, 0.20, 0.30, 0.20, 0.05],
    ], dtype=np.float32)
    KERN_GPU = cp.asarray(_KERN_HOST.ravel())

    _DY_OFFSETS = cp.asarray(np.array([-1, -1, -1, 0, 0, 0, 1, 1, 1],
                                      dtype=np.int32))
    _DX_OFFSETS = cp.asarray(np.array([-1, 0, 1, -1, 0, 1, -1, 0, 1],
                                      dtype=np.int32))


    def render_points(buf, pos, S):
        buf *= 0.80
        H, W = int(S[0]), int(S[1])
        threads = 256
        blocks = (pos.shape[0] + threads - 1) // threads
        _SPLAT_KERNEL((blocks,), (threads,),
                      (pos.astype(cp.float32), buf, KERN_GPU,
                       pos.shape[0], H, W))
        buf_max = float(buf.max())
        bright = cp.clip(cp.log1p(buf) / cp.log1p(buf_max + 1e-9) * 255,
                         0, 255).astype(cp.uint8)
        return cp.asnumpy(bright)


    @cp.fuse()
    def _force_value(rn, Aij_pairs, beta):
        rep = rn / beta - 1.0
        att = Aij_pairs * (1.0 - cp.abs(2.0 * rn - 1.0 - beta) / (1.0 - beta))
        return cp.where(rn < beta, rep, cp.where(rn < 1.0, att, 0.0))


    def particle_life_stream(width, height, iters, ffmpeg_stdin,
                              N=2000, K=5, seed=480):
        """Run the simulation; write each uint8 grayscale frame directly
        to the supplied open file/pipe (so encoding can happen in parallel
        with simulation and no large raw file is materialised)."""
        rng = np.random.default_rng(seed)
        S_host = np.array([height, width], dtype=np.float32)
        S = cp.asarray(S_host)

        pos = cp.asarray(rng.random((N, 2)).astype(np.float32)) * S
        vel = cp.zeros((N, 2), dtype=cp.float32)
        typ = cp.asarray(rng.integers(0, K, N).astype(np.int32))
        A = cp.asarray(rng.uniform(-1, 1, (K, K)).astype(np.float32))

        rmax = cp.float32(88.0)
        rmax_val = 88.0
        beta = cp.float32(0.3)
        friction = cp.float32(0.85)
        fs = cp.float32(0.8)
        dt = cp.float32(0.6)

        buf = cp.zeros((height, width), dtype=cp.float32)
        cell_size = rmax_val
        grid_h = int(np.ceil(height / cell_size))
        grid_w = int(np.ceil(width / cell_size))
        n_cells = grid_h * grid_w

        print(f"Rendering {iters} frames at {width}×{height}, N={N} on GPU...",
              flush=True)
        mem_pool = cp.get_default_memory_pool()

        import time
        t_start = time.time()
        last_report = t_start

        for t in range(iters):
            cell_y = (pos[:, 0] / cell_size).astype(cp.int32) % grid_h
            cell_x = (pos[:, 1] / cell_size).astype(cp.int32) % grid_w
            cell_idx = cell_y * grid_w + cell_x

            sort_order = cp.argsort(cell_idx)
            cell_idx_sorted = cell_idx[sort_order]

            all_cells = cp.arange(n_cells, dtype=cp.int32)
            cell_start_all = cp.searchsorted(cell_idx_sorted, all_cells,
                                              side='left')
            cell_end_all = cp.searchsorted(cell_idx_sorted, all_cells,
                                            side='right')
            cell_start = cp.where(cell_end_all > cell_start_all,
                                  cell_start_all, -1)
            cell_end = cell_end_all

            ny = (cell_y[:, None] + _DY_OFFSETS[None, :]) % grid_h
            nx = (cell_x[:, None] + _DX_OFFSETS[None, :]) % grid_w
            neighbour_ids = (ny * grid_w + nx).astype(cp.int32)
            flat_ne = neighbour_ids.reshape(-1)
            starts = cell_start[flat_ne]
            ends = cell_end[flat_ne]
            counts = cp.where(starts >= 0, ends - starts, 0).astype(cp.int32)

            total = int(counts.sum())

            per_particle = counts.reshape(N, 9).sum(axis=1)
            idx_i = cp.repeat(cp.arange(N, dtype=cp.int32), per_particle)

            cum_counts = cp.concatenate(
                [cp.zeros(1, dtype=cp.int64),
                 cp.cumsum(counts.astype(cp.int64))])
            positions = cp.arange(total, dtype=cp.int64)
            pair_idx = cp.searchsorted(cum_counts[1:], positions,
                                        side='right').astype(cp.int32)
            local_offset = (positions - cum_counts[pair_idx]).astype(cp.int32)
            sort_idx = starts[pair_idx] + local_offset
            idx_j = sort_order[sort_idx].astype(cp.int32)

            keep = idx_i != idx_j
            idx_i = idx_i[keep]
            idx_j = idx_j[keep]

            d = pos[idx_j] - pos[idx_i]
            cp.subtract(d[:, 0], S[0] * cp.round(d[:, 0] / S[0]), out=d[:, 0])
            cp.subtract(d[:, 1], S[1] * cp.round(d[:, 1] / S[1]), out=d[:, 1])
            dist = cp.sqrt((d * d).sum(axis=1))

            within = dist <= rmax
            idx_i = idx_i[within]
            idx_j = idx_j[within]
            d = d[within]
            dist = dist[within]

            rn = dist / rmax
            dirv = d / (dist[:, None] + cp.float32(1e-9))

            Aij_pairs = A[typ[idx_i], typ[idx_j]]
            F_val = _force_value(rn, Aij_pairs, beta)
            force_contrib = fs * F_val[:, None] * dirv

            acc = cp.zeros((N, 2), dtype=cp.float32)
            cp.add.at(acc, idx_i, force_contrib)

            vel = vel * friction + acc * dt
            pos = (pos + vel * dt) % S

            frame_host = render_points(buf, pos, S)
            ffmpeg_stdin.write(frame_host.tobytes())

            if (t & 0x1F) == 0:
                mem_pool.free_all_blocks()

            # Progress report every ~10 sec wall clock
            now = time.time()
            if now - last_report > 10:
                elapsed = now - t_start
                fps = (t + 1) / elapsed
                eta = (iters - t - 1) / fps
                print(f"  frame {t + 1}/{iters}  "
                      f"({(t + 1) / iters * 100:5.1f}%)  "
                      f"fps={fps:5.1f}  "
                      f"eta={eta / 60:5.1f} min", flush=True)
                last_report = now

        print(f"Completed {iters} frames in {time.time() - t_start:.0f} sec")


    if __name__ == "__main__":
        seed = 480
        P = PARAMS
        width, height = P['width'], P['height']
        iters, N = P['iters'], P['N']
        outdir = (f"/home/fox/Downloads/py/output/even_more/particle_life/"
                  f"{seed}_{MODE}")
        Path(outdir).mkdir(parents=True, exist_ok=True)
        fps = P.get('fps', 20)
        out_ext = P.get('out_ext', '.mp4')
        mp4_path = Path(outdir) / f"animation{out_ext}"

        print(f"=== Particle Life Rendering: MODE={MODE} ===")
        print(f"Seed: {seed}   Resolution: {width}×{height}   "
              f"Duration: {iters / fps:.1f} sec ({iters} frames @ {fps} fps)")
        print(f"N: {N}   GPU: "
              f"{cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
        print()

        # ffmpeg command construction.  Two distinct flavours: real DVD-Video
        # (-target ntsc-dvd, MPEG-2 PS, silent AC-3 audio track) vs the
        # h264/x265 .mp4 path.  Both stream video frames from stdin so no
        # giant raw file is materialised.
        if P.get('dvd_target'):
            # DVD-Video pipeline — outputs a .mpg you feed to dvdauthor.
            cmd = [
                'ffmpeg', '-y',
                '-f', 'rawvideo', '-pix_fmt', 'gray',
                '-s', f'{width}x{height}',
                '-framerate', str(fps),
                '-i', '-',
            ]
            if P.get('add_silent_audio'):
                cmd += ['-f', 'lavfi',
                        '-i', 'anullsrc=channel_layout=stereo:sample_rate=48000']
            cmd += ['-target', P['dvd_target']]
            # Override video bitrate that -target ntsc-dvd would default to.
            # The DVD-Video max combined bitrate is 10.08 Mbps; we deliberately
            # stay well under so we hit a specific final file size.
            if P.get('video_bitrate'):
                cmd += ['-b:v', P['video_bitrate']]
            if P.get('audio_bitrate'):
                cmd += ['-b:a', P['audio_bitrate']]
            if P.get('aspect'):
                cmd += ['-aspect', P['aspect']]
            cmd += ['-shortest', str(mp4_path)]
        else:
            cmd = [
                'ffmpeg', '-y',
                '-f', 'rawvideo', '-pix_fmt', 'gray',
                '-s', f'{width}x{height}',
                '-framerate', str(fps),
                '-i', '-',
                '-c:v', P.get('codec', 'libx264'),
                '-pix_fmt', 'yuv420p',
                '-preset', 'medium',
            ]
            if P.get('bitrate'):
                cmd += ['-b:v', P['bitrate']]
                if P.get('maxrate'):
                    cmd += ['-maxrate', P['maxrate'], '-bufsize', P['bufsize']]
            elif P.get('crf') is not None:
                cmd += ['-crf', str(P['crf'])]
            if P.get('aspect'):
                cmd += ['-aspect', P['aspect']]
            cmd += [str(mp4_path)]

        print(f"ffmpeg cmd: {' '.join(cmd)}")
        print()
        ff = subprocess.Popen(cmd, stdin=subprocess.PIPE,
                              stderr=subprocess.DEVNULL)
        try:
            particle_life_stream(width, height, iters, ff.stdin,
                                 N=N, seed=seed)
        finally:
            ff.stdin.close()
            ff.wait()

        sz = mp4_path.stat().st_size if mp4_path.exists() else 0
        print(f"\nMP4 saved: {mp4_path} ({sz / 1024**3:.2f} GB, "
              f"{sz / 1024**2:.1f} MB)")
        print(f"\n=== Complete ===")

------------------------------------------------------------------------

Related pages

-   3-GPU mesh — hardware context for the 4090 this script targets.

------------------------------------------------------------------------

Updated 2026-06-29.

fox@neoblanka:~/git/uncloseai-cli$

  '-preset', 'medium', ] if P.get('bitrate'): cmd += ['-b:v',
  P['bitrate']] if P.get('maxrate'): cmd += ['-maxrate', P['maxrate'],
  '-bufsize', P['bufsize']] elif P.get('crf') is not None: cmd +=
  ['-crf', str(P['crf'])] if P.get('aspect'): cmd += ['-aspect',
  P['aspect']] cmd += [str(mp4_path)]

  print(f"ffmpeg cmd: {' '.join(cmd)}") print() ff =
  subprocess.Popen(cmd, stdin=subprocess.PIPE,
  stderr=subprocess.DEVNULL) try: particle_life_stream(width, height,
  iters, ff.stdin, N=N, seed=seed) finally: ff.stdin.close() ff.wait()

  sz = mp4_path.stat().st_size if mp4_path.exists() else 0 print(f"nMP4
  saved: {mp4_path} ({sz / 1024*3:.2f} GB, " f"{sz / 1024*2:.1f} MB)")
  print(f"n=== Complete ===")

------------------------------------------------------------------------

Related pages

-   3-GPU mesh — hardware context for the 4090 this script targets.

------------------------------------------------------------------------

Source:
https://foxhop.net/0b6d520b-7414-11f1-9565-040140774501/particle-life-on-gpu-cupy-port-for-long-form-renders
Snapshot: 2026-06-30T04:04:07Z
Generator: Remarkbox dba4024