| t | | t | .. gpu-mesh.rst — public wiki page paste source; remarkbox assigns slug |
| | | .. CC0 / public domain content; bench data from 2026-06-06/07 |
| | | .. foxhop · agent blackops |
| | | |
| | | 3-GPU Mesh — Hardware, Benchmarks, & Coexistence |
| | | ================================================== |
| | | |
| | | | **Status:** 3-node mesh live on port 8320 (BEND) |
| | | | **Wire substrate:** lumbda's ``bend`` primitive (lumbda.com_) |
| | | | **Coordinator pattern:** ``coordinator-mesh.py`` — cost-routed fan-out |
| | | |
| | | .. _lumbda.com: https://lumbda.com/bend.html |
| | | |
| | | ---- |
| | | |
| | | .. contents:: |
| | | :depth: 2 |
| | | :local: |
| | | |
| | | ---- |
| | | |
| | | What this page tracks |
| | | --------------------- |
| | | |
| | | Our 3-GPU foxhop pool serves dual duty: research workloads (lumbda's |
| | | bend primitive dispatches CUDA work over our wire protocol) & LLM |
| | | inference (qwen, hermes, speech, ollama). Both classes coexist on our |
| | | same hardware with VRAM partitioning per node. This page documents our |
| | | hardware, our published benchmarks, & our coexistence pattern. |
| | | |
| | | For our ecdsa research context that drove our first 18-cell bench |
| | | (refined Bernstein-Yang point-add variant sweep at p=251), see |
| | | `foxhop secp256k1 lever sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-04 |
| | | 0140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__. |
| | | |
| | | ---- |
| | | |
| | | Hardware |
| | | -------- |
| | | |
| | | Three Nvidia GPUs across three hosts on our foxhop LAN: |
| | | |
| | | ======================== =============== ==== ========= ==================== |
| | | ======== |
| | | host GPU VRAM arch role |
| | | ======================== =============== ==== ========= ==================== |
| | | ======== |
| | | ``3090-ai.foxhop.net`` RTX 3090 24 GB sm_86 bend mesh + Hermes L |
| | | LM |
| | | ``ai.foxhop.net`` RTX 4090 24 GB sm_89 bend mesh + qwen LLM |
| | | + speech TTS |
| | | ``cammy.foxhop.net`` Tesla P40 24 GB sm_61 bend mesh + ollama ( |
| | | qwen3-vl) |
| | | ======================== =============== ==== ========= ==================== |
| | | ======== |
| | | |
| | | 72 GB combined VRAM across our pool. Pascal (sm_61) P40 lacks our |
| | | newer architectures' reduced-precision tensor units, so it runs ~2× |
| | | slower than Ampere 3090 on identical workloads — useful as a third |
| | | concurrent stream rather than a faster replacement for either Ampere |
| | | card. |
| | | |
| | | ---- |
| | | |
| | | Port 8320 — BEND |
| | | ---------------- |
| | | |
| | | Our pool serves bend dispatches on port **8320** across every host. |
| | | Mnemonic — 8320 spells BEND: |
| | | |
| | | :: |
| | | |
| | | 8 ~= B (implied infinity B flattened; bake a cake; baby & me) |
| | | 3 ~= E (backward) |
| | | 2 ~= N (pivoted 90 degrees) |
| | | 0 ~= D (flattened) |
| | | |
| | | Each node runs ``gpu-worker.lsp`` (lumbda's reference CUDA worker) |
| | | against demo_ops, blake3-fanout, radix-sort, & our other registered |
| | | CUDA forms. Clients reach our mesh via lumbda's ``bend`` primitive over |
| | | TCP, S-expression wire format, & per-form binary protocols (``BSHK`` |
| | | for shake fanout, ``BSCP`` for batched scalar mul, ``BSRT`` for radix |
| | | sort, etc. — see lumbda.com_/bend for our form catalog). |
| | | |
| | | ---- |
| | | |
| | | Benchmark — 6 variants × 3 hosts at p=251 |
| | | ------------------------------------------ |
| | | |
| | | Our first published benchmark dispatched 6 ecdsa point-add variants |
| | | (Bernstein-Yang lever sweep) at p=251 to every host through our wire |
| | | protocol. Σ Toffoli matches **byte-for-byte** across hosts on identical |
| | | ops.bin inputs — proves our distributed mesh runs from one canonical |
| | | lumbda environment. |
| | | |
| | | GPU wall (ms/batch) at 1024 shots, lower wins: |
| | | |
| | | ======================== ============ ========= ========= ========= |
| | | variant Σ Tof / shot 3090 ms 4090 ms P40 ms |
| | | ======================== ============ ========= ========= ========= |
| | | v-fermat-schoolbook 167 984 290.1 149.2 531.7 |
| | | v-fermat-solinas 84 208 151.4 86.3 288.6 |
| | | v-by-text-schoolbook 45 104 68.4 39.3 131.8 |
| | | v-by-text-solinas 40 176 60.6 35.6 119.3 |
| | | v-by-ref-schoolbook 29 104 32.5 25.7 85.1 |
| | | v-by-ref-solinas 24 176 26.5 22.0 64.0 |
| | | ======================== ============ ========= ========= ========= |
| | | |
| | | Per-architecture pattern reads cleanly: |
| | | |
| | | - **Ampere 4090 (sm_89)** runs 1.94× faster than 3090 (sm_86) on our |
| | | heaviest circuit, shrinking to 1.20× on our lightest. Larger circuits |
| | | amortize kernel-launch overhead; smaller circuits hit our GPU's |
| | | overhead floor. |
| | | - **Pascal P40 (sm_61)** runs ~2× slower than 3090 across our entire |
| | | variant chain. Older arch, lower base clock, no reduced-precision |
| | | tensor units; still serves a third concurrent stream as our candidate |
| | | queue deepens. |
| | | |
| | | GPU wall scales linearly with Σ Toffoli — no kernel-level surprise. |
| | | Cross-host Σ Toffoli identity confirms bit-exact reproducibility across |
| | | our pool. |
| | | |
| | | ---- |
| | | |
| | | Coordinator fan-out |
| | | ------------------- |
| | | |
| | | ``ecdsa/scripts/coordinator-mesh.py`` dispatches candidates across our |
| | | 3-node mesh concurrently. Greedy bin-pack by predicted Σ Toffoli; each |
| | | candidate lands on whichever node's finish time after add stays lowest. |
| | | Per-host speed factor derived from our bench above: |
| | | |
| | | - 3090: factor 1.00 (baseline) |
| | | - 4090: factor 0.51 |
| | | - P40: factor 1.84 |
| | | |
| | | One ThreadPoolExecutor worker per node fires concurrently against our |
| | | wire protocol. Each node serializes its own queue (workers do not |
| | | multiplex requests cleanly inside one CUDA context). |
| | | |
| | | First-run finding: cost model based only on (predicted Σ Toffoli × |
| | | host factor) misses our per-request overhead floor (~600 ms on cammy |
| | | for demo_ops process spawn + Pascal CUDA init). Refinement options: a |
| | | per-host startup constant plus a Σ Toffoli linear term, fit per node |
| | | from our sweep history. Coordinator runs correctly today; schedule |
| | | shape needs calibration, not algorithmic change. |
| | | |
| | | ---- |
| | | |
| | | Coexistence with LLM services |
| | | ------------------------------ |
| | | |
| | | Each host on our pool carries an LLM service alongside its bend worker |
| | | slot. Bend workloads launch on demand; LLMs hold VRAM persistently. We |
| | | partition VRAM per host: |
| | | |
| | | ================ ================ ============================================== |
| | | ============= |
| | | host LLM workload VRAM partition |
| | | ================ ================ ============================================== |
| | | ============= |
| | | 3090-ai Hermes vllm Hermes-3-Llama-3.1-8B-FP8-Dynamic, KV 82 K tok |
| | | ens, 22.7 GB |
| | | ai (4090) qwen llama.cpp Qwen3.6-27B-UD-Q4_K_XL.gguf, ``-c 32768`` KV, |
| | | 16.8 GB |
| | | ai (4090) speech F5-TTS f5-tts (336M params), lazy load, 1.3-3.3 GB |
| | | cammy (P40) ollama qwen3-vl qwen3-vl:8b, lazy load on first request |
| | | ================ ================ ============================================== |
| | | ============= |
| | | |
| | | VRAM partition decisions: |
| | | |
| | | - 4090 carries both qwen LLM & speech TTS. qwen LLM eats 16.8 GB; F5-TTS |
| | | needs ~3 GB headroom for inference spikes. Reduced qwen context from |
| | | ``-c 65536`` to ``-c 32768`` (frees ~3.7 GB) so F5-TTS sees enough |
| | | inference room without OOM. ``tts-1-qwen`` registration in speech.py |
| | | stays disabled (qwen3-tts at 1.7B params claimed 6-7 GB more than |
| | | F5-TTS at 336M). |
| | | - 3090 carries Hermes only — no contention since bench workloads run |
| | | on demand & release VRAM at request boundary. |
| | | - P40 carries ollama for qwen3-vl vision/language requests. Lazy load |
| | | pattern (ollama only holds VRAM while serving a request). |
| | | |
| | | When our 3-GPU pool runs a bend benchmark, LLM workloads stay live; |
| | | bench transient memory drops slot into available headroom per host. |
| | | Long-running bench sweeps (e.g., variant emission at production scale) |
| | | displace an LLM cleanly when needed — ``systemctl stop llama-qwen`` |
| | | frees 16.8 GB on demand. |
| | | |
| | | ---- |
| | | |
| | | Caddy ingress |
| | | ------------- |
| | | |
| | | Public access routes through ``proxy.unturf.com``. Per-service Caddy |
| | | blocks gate behavior: |
| | | |
| | | ================================== ============================================ |
| | | ======= |
| | | ingress gate |
| | | ================================== ============================================ |
| | | ======= |
| | | ``qwen.ai.unturf.com`` Bearer token (``sk-unturf-*`` or ``sk-friend |
| | | s-*``); |
| | | foxhop LAN egress (138.207.194.52) bypasses |
| | | key. |
| | | ``hermes.ai.unturf.com`` (TBD per workload) |
| | | ``speech.ai.unturf.com`` Rate-limited reverse_proxy; foxhop LAN bypas |
| | | ses |
| | | rate limit. |
| | | ================================== ============================================ |
| | | ======= |
| | | |
| | | Bend wire on port 8320 stays LAN-only; we do **not** expose our bend |
| | | mesh outside our foxhop network. External research workloads that want |
| | | our pool's throughput route through a foxhop-egress relay. |
| | | |
| | | ---- |
| | | |
| | | Pool capacity math |
| | | ------------------ |
| | | |
| | | Coordinator at saturation. Three nodes scoring three different |
| | | candidates simultaneously closes our sequential 6-variant 3090 wall |
| | | (629 ms) into a parallel one. Refined-Solinas runs in ~25 ms on 3090 |
| | | + 22 ms on 4090 + 64 ms on P40 — P40 lags but contributes a third |
| | | concurrent stream as our candidate queue deepens. Per-host routing |
| | | (heavy circuits to faster GPUs, light circuits to P40) maximizes |
| | | aggregate dispatch. |
| | | |
| | | Pool throughput grows with our candidate generation rate — see |
| | | ecdsa.rst's "Why CPU Produces & GPU Scores" section for our |
| | | substrate-level math on candidate-emit vs candidate-score throughput. |
| | | |
| | | ---- |
| | | |
| | | Related pages |
| | | ------------- |
| | | |
| | | - `foxhop secp256k1 lever sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9- |
| | | 040140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__ — researc |
| | | h context; |
| | | full 18-cell cross-scale Pareto sweep; lumbda emit pipeline lift. |
| | | - `lumbda.com/bend <https://lumbda.com/bend.html>`__ — bend primitive |
| | | catalog; wire protocol; CUDA form list. |
| | | |
| | | ---- |
| | | |
| | | | *Updated 2026-06-07.* |