gpu-mesh

rev 1 | russell | 1780841003135 | JSON

+.. gpu-mesh.rst — public wiki page paste source; remarkbox assigns slug
+.. CC0 / public domain content; bench data from 2026-06-06/07
+.. foxhop · agent blackops
+-GPU Mesh — Hardware, Benchmarks, & Coexistence
+==================================================
+| **Status:** 3-node mesh live on port 8320 (BEND)
+| **Wire substrate:** lumbda's ``bend`` primitive (lumbda.com_)
+| **Coordinator pattern:** ``coordinator-mesh.py`` — cost-routed fan-out
+.. _lumbda.com: https://lumbda.com/bend.html
+----
+.. contents::
+   :depth: 2
+   :local:
+----
+What this page tracks
+---------------------
+Our 3-GPU foxhop pool serves dual duty: research workloads (lumbda's
+bend primitive dispatches CUDA work over our wire protocol) & LLM
+inference (qwen, hermes, speech, ollama). Both classes coexist on our
+same hardware with VRAM partitioning per node. This page documents our
+hardware, our published benchmarks, & our coexistence pattern.
+For our ecdsa research context that drove our first 18-cell bench
+(refined Bernstein-Yang point-add variant sweep at p=251), see
+`foxhop secp256k1 lever sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-04
+0140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__.
+----
+Hardware
+--------
+Three Nvidia GPUs across three hosts on our foxhop LAN:
+========================  ===============  ====  =========  ====================
+========
+host                      GPU              VRAM  arch       role
+========================  ===============  ====  =========  ====================
+========
+``3090-ai.foxhop.net``    RTX 3090         24 GB sm_86      bend mesh + Hermes L
+LM
+``ai.foxhop.net``         RTX 4090         24 GB sm_89      bend mesh + qwen LLM
+ + speech TTS
+``cammy.foxhop.net``      Tesla P40        24 GB sm_61      bend mesh + ollama (
+qwen3-vl)
+========================  ===============  ====  =========  ====================
+========
+ GB combined VRAM across our pool. Pascal (sm_61) P40 lacks our
+newer architectures' reduced-precision tensor units, so it runs ~2×
+slower than Ampere 3090 on identical workloads — useful as a third
+concurrent stream rather than a faster replacement for either Ampere
+card.
+----
+Port 8320 — BEND
+----------------
+Our pool serves bend dispatches on port **8320** across every host.
+Mnemonic — 8320 spells BEND:
+::
+ ~= B (implied infinity B flattened; bake a cake; baby & me)
+ ~= E (backward)
+ ~= N (pivoted 90 degrees)
+ ~= D (flattened)
+Each node runs ``gpu-worker.lsp`` (lumbda's reference CUDA worker)
+against demo_ops, blake3-fanout, radix-sort, & our other registered
+CUDA forms. Clients reach our mesh via lumbda's ``bend`` primitive over
+TCP, S-expression wire format, & per-form binary protocols (``BSHK``
+for shake fanout, ``BSCP`` for batched scalar mul, ``BSRT`` for radix
+sort, etc. — see lumbda.com_/bend for our form catalog).
+----
+Benchmark — 6 variants × 3 hosts at p=251
+------------------------------------------
+Our first published benchmark dispatched 6 ecdsa point-add variants
+(Bernstein-Yang lever sweep) at p=251 to every host through our wire
+protocol. Σ Toffoli matches **byte-for-byte** across hosts on identical
+ops.bin inputs — proves our distributed mesh runs from one canonical
+lumbda environment.
+GPU wall (ms/batch) at 1024 shots, lower wins:
+========================  ============  =========  =========  =========
+variant                   Σ Tof / shot  3090 ms    4090 ms    P40 ms
+========================  ============  =========  =========  =========
+v-fermat-schoolbook       167 984       290.1      149.2      531.7
+v-fermat-solinas           84 208       151.4       86.3      288.6
+v-by-text-schoolbook       45 104        68.4       39.3      131.8
+v-by-text-solinas          40 176        60.6       35.6      119.3
+v-by-ref-schoolbook        29 104        32.5       25.7       85.1
+v-by-ref-solinas           24 176        26.5       22.0       64.0
+========================  ============  =========  =========  =========
+Per-architecture pattern reads cleanly:
+- **Ampere 4090 (sm_89)** runs 1.94× faster than 3090 (sm_86) on our
+  heaviest circuit, shrinking to 1.20× on our lightest. Larger circuits
+  amortize kernel-launch overhead; smaller circuits hit our GPU's
+  overhead floor.
+- **Pascal P40 (sm_61)** runs ~2× slower than 3090 across our entire
+  variant chain. Older arch, lower base clock, no reduced-precision
+  tensor units; still serves a third concurrent stream as our candidate
+  queue deepens.
+GPU wall scales linearly with Σ Toffoli — no kernel-level surprise.
+Cross-host Σ Toffoli identity confirms bit-exact reproducibility across
+our pool.
+----
+Coordinator fan-out
+-------------------
+``ecdsa/scripts/coordinator-mesh.py`` dispatches candidates across our
+-node mesh concurrently. Greedy bin-pack by predicted Σ Toffoli; each
+candidate lands on whichever node's finish time after add stays lowest.
+Per-host speed factor derived from our bench above:
+- 3090: factor 1.00 (baseline)
+- 4090: factor 0.51
+- P40:  factor 1.84
+One ThreadPoolExecutor worker per node fires concurrently against our
+wire protocol. Each node serializes its own queue (workers do not
+multiplex requests cleanly inside one CUDA context).
+First-run finding: cost model based only on (predicted Σ Toffoli ×
+host factor) misses our per-request overhead floor (~600 ms on cammy
+for demo_ops process spawn + Pascal CUDA init). Refinement options: a
+per-host startup constant plus a Σ Toffoli linear term, fit per node
+from our sweep history. Coordinator runs correctly today; schedule
+shape needs calibration, not algorithmic change.
+----
+Coexistence with LLM services
+------------------------------
+Each host on our pool carries an LLM service alongside its bend worker
+slot. Bend workloads launch on demand; LLMs hold VRAM persistently. We
+partition VRAM per host:
+================ ================ ==============================================
+=============
+host             LLM workload     VRAM partition
+================ ================ ==============================================
+=============
+-ai          Hermes vllm      Hermes-3-Llama-3.1-8B-FP8-Dynamic, KV 82 K tok
+ens, 22.7 GB
+ai (4090)        qwen llama.cpp   Qwen3.6-27B-UD-Q4_K_XL.gguf, ``-c 32768`` KV,
+.8 GB
+ai (4090)        speech F5-TTS    f5-tts (336M params), lazy load, 1.3-3.3 GB
+cammy (P40)      ollama qwen3-vl  qwen3-vl:8b, lazy load on first request
+================ ================ ==============================================
+=============
+VRAM partition decisions:
+- 4090 carries both qwen LLM & speech TTS. qwen LLM eats 16.8 GB; F5-TTS
+  needs ~3 GB headroom for inference spikes. Reduced qwen context from
+  ``-c 65536`` to ``-c 32768`` (frees ~3.7 GB) so F5-TTS sees enough
+  inference room without OOM. ``tts-1-qwen`` registration in speech.py
+  stays disabled (qwen3-tts at 1.7B params claimed 6-7 GB more than
+  F5-TTS at 336M).
+- 3090 carries Hermes only — no contention since bench workloads run
+  on demand & release VRAM at request boundary.
+- P40 carries ollama for qwen3-vl vision/language requests. Lazy load
+  pattern (ollama only holds VRAM while serving a request).
+When our 3-GPU pool runs a bend benchmark, LLM workloads stay live;
+bench transient memory drops slot into available headroom per host.
+Long-running bench sweeps (e.g., variant emission at production scale)
+displace an LLM cleanly when needed — ``systemctl stop llama-qwen``
+frees 16.8 GB on demand.
+----
+Caddy ingress
+-------------
+Public access routes through ``proxy.unturf.com``. Per-service Caddy
+blocks gate behavior:
+==================================  ============================================
+=======
+ingress                             gate
+==================================  ============================================
+=======
+``qwen.ai.unturf.com``              Bearer token (``sk-unturf-*`` or ``sk-friend
+s-*``);
+                                    foxhop LAN egress (138.207.194.52) bypasses
+key.
+``hermes.ai.unturf.com``            (TBD per workload)
+``speech.ai.unturf.com``            Rate-limited reverse_proxy; foxhop LAN bypas
+ses
+                                    rate limit.
+==================================  ============================================
+=======
+Bend wire on port 8320 stays LAN-only; we do **not** expose our bend
+mesh outside our foxhop network. External research workloads that want
+our pool's throughput route through a foxhop-egress relay.
+----
+Pool capacity math
+------------------
+Coordinator at saturation. Three nodes scoring three different
+candidates simultaneously closes our sequential 6-variant 3090 wall
+(629 ms) into a parallel one. Refined-Solinas runs in ~25 ms on 3090
++ 22 ms on 4090 + 64 ms on P40 — P40 lags but contributes a third
+concurrent stream as our candidate queue deepens. Per-host routing
+(heavy circuits to faster GPUs, light circuits to P40) maximizes
+aggregate dispatch.
+Pool throughput grows with our candidate generation rate — see
+ecdsa.rst's "Why CPU Produces & GPU Scores" section for our
+substrate-level math on candidate-emit vs candidate-score throughput.
+----
+Related pages
+-------------
+- `foxhop secp256k1 lever sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-
+040140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__ — researc
+h context;
+  full 18-cell cross-scale Pareto sweep; lumbda emit pipeline lift.
+- `lumbda.com/bend <https://lumbda.com/bend.html>`__ — bend primitive
+  catalog; wire protocol; CUDA form list.
+----
+| *Updated 2026-06-07.*