gpu-mesh Source: https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh Snapshot: 2026-06-07T17:39:18Z Generator: Remarkbox 50b9d1e This is a thread snapshot. The living document lives at the source URI above — it may have been edited, extended, or replied-to since. [Scan for living source] 3-GPU Mesh — Hardware & Benchmarks Status: 3-node mesh live on port 8320 (BEND) Wire substrate: lumbda's bend primitive (lumbda.com) Coordinator pattern: cost-routed fan-out via greedy bin-pack ------------------------------------------------------------------------ ------------------------------------------------------------------------ What this page tracks Three Nvidia GPUs across three hosts on our foxhop LAN serve as a bend mesh — lumbda's CUDA dispatch primitive routes work to whichever node finishes fastest. This page documents our hardware, our published benchmarks, & our cost-routed coordinator pattern. For our ecdsa research context that drove our first 18-cell bench (refined Bernstein-Yang point-add variant sweep at p=251), see foxhop secp256k1 lever sweep. ------------------------------------------------------------------------ Hardware host GPU VRAM arch -------------------- ----------- ------- ------- 3090-ai.foxhop.net RTX 3090 24 GB sm_86 ai.foxhop.net RTX 4090 24 GB sm_89 cammy.foxhop.net Tesla P40 24 GB sm_61 72 GB combined VRAM across our pool. Pascal (sm_61) P40 lacks our newer architectures' reduced-precision tensor units, so it runs ~2× slower than Ampere 3090 on identical workloads — useful as a third concurrent stream rather than a faster replacement for either Ampere card. ------------------------------------------------------------------------ Port 8320 — BEND Our pool serves bend dispatches on port 8320 across every host. Mnemonic — 8320 spells BEND: 8 ~= B (implied infinity B flattened; bake a cake; baby & me) 3 ~= E (backward) 2 ~= N (pivoted 90 degrees) 0 ~= D (flattened) Each node runs gpu-worker.lsp (lumbda's reference CUDA worker) against demo_ops, blake3-fanout, radix-sort, & our other registered CUDA forms. Clients reach our mesh via lumbda's bend primitive over TCP, S-expression wire format, & per-form binary protocols (BSHK for shake fanout, BSCP for batched scalar mul, BSRT for radix sort, etc. — see lumbda.com/bend for our form catalog). Our bend wire stays LAN-only; we do not expose our mesh outside our foxhop network. ------------------------------------------------------------------------ Benchmark — 6 variants × 3 hosts at p=251 Our first published benchmark dispatched 6 ecdsa point-add variants (Bernstein-Yang lever sweep) at p=251 to every host through our wire protocol. Σ Toffoli matches byte-for-byte across hosts on identical ops.bin inputs — proves our distributed mesh runs from one canonical lumbda environment. GPU wall (ms/batch) at 1024 shots, lower wins: +------------+------------+------------+------------+------------+ | variant | Σ Tof / | 3090 ms | 4090 ms | P40 ms | | | shot | | | | +============+============+============+============+============+ | v-fermat | 167 984 | 290.1 | 149.2 | 531.7 | | - | | | | | | schoolbook | | | | | +------------+------------+------------+------------+------------+ | v-fer | 84 208 | 151.4 | 86.3 | 288.6 | | m | | | | | | at-solinas | | | | | +------------+------------+------------+------------+------------+ | v-by-text | 45 104 | 68.4 | 39.3 | 131.8 | | - | | | | | | schoolbook | | | | | +------------+------------+------------+------------+------------+ | v-by-t | 40 176 | 60.6 | 35.6 | 119.3 | | e | | | | | | xt-solinas | | | | | +------------+------------+------------+------------+------------+ | v-by-ref | 29 104 | 32.5 | 25.7 | 85.1 | | - | | | | | | schoolbook | | | | | +------------+------------+------------+------------+------------+ | v-by- | 24 176 | 26.5 | 22.0 | 64.0 | | r | | | | | | ef-solinas | | | | | +------------+------------+------------+------------+------------+ Per-architecture pattern reads cleanly: - Ampere 4090 (sm_89) runs 1.94× faster than 3090 (sm_86) on our heaviest circuit, shrinking to 1.20× on our lightest. Larger circuits amortize kernel-launch overhead; smaller circuits hit our GPU's overhead floor. - Pascal P40 (sm_61) runs ~2× slower than 3090 across our entire variant chain. Older arch, lower base clock, no reduced-precision tensor units; still serves a third concurrent stream as our candidate queue deepens. GPU wall scales linearly with Σ Toffoli — no kernel-level surprise. Cross-host Σ Toffoli identity confirms bit-exact reproducibility across our pool. ------------------------------------------------------------------------ Coordinator fan-out A cost-routed coordinator dispatches candidates across our 3-node mesh concurrently. Greedy bin-pack by predicted Σ Toffoli; each candidate lands on whichever node's finish time after add stays lowest. Per-host speed factor derived from our bench above: - 3090: factor 1.00 (baseline) - 4090: factor 0.51 - P40: factor 1.84 One worker thread per node fires concurrently against our wire protocol. Each node serializes its own queue (workers do not multiplex requests cleanly inside one CUDA context). First-run finding: cost model based only on (predicted Σ Toffoli × host factor) misses our per-request overhead floor (~600 ms on cammy for demo_ops process spawn + Pascal CUDA init). Refinement options: a per-host startup constant plus a Σ Toffoli linear term, fit per node from our sweep history. Coordinator runs correctly today; schedule shape needs calibration, not algorithmic change. ------------------------------------------------------------------------ Pool capacity math Coordinator at saturation. Three nodes scoring three different candidates simultaneously closes our sequential 6-variant 3090 wall (629 ms) into a parallel one. Refined-Solinas runs in ~25 ms on 3090 + 22 ms on 4090 + 64 ms on P40 — P40 lags but contributes a third concurrent stream as our candidate queue deepens. Per-host routing (heavy circuits to faster GPUs, light circuits to P40) maximizes aggregate dispatch. Pool throughput grows with our candidate generation rate — see our ecdsa research page's "Why CPU Produces & GPU Scores" section for our substrate-level math on candidate-emit vs candidate-score throughput. ------------------------------------------------------------------------ Related pages - foxhop secp256k1 lever sweep — research context; full 18-cell cross-scale Pareto sweep; lumbda emit pipeline lift. - lumbda.com/bend — bend primitive catalog; wire protocol; CUDA form list. ------------------------------------------------------------------------ Updated 2026-06-07. ------------------------------------------------------------------------ Source: https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh Snapshot: 2026-06-07T17:39:18Z Generator: Remarkbox 50b9d1e