{"revision": {"id": "a5703d49-6279-11f1-a726-040140774501", "node_id": "053ba7da-6277-11f1-82fc-040140774501", "user_id": "680aec8e-3391-11f1-95d9-040140774501", "author": "russell", "data": ".. gpu-mesh.rst \u2014 public wiki page paste source; remarkbox assigns slug\r\n.. CC0 / public domain content; bench data from 2026-06-06/07\r\n.. foxhop \u00b7 agent blackops\r\n\r\n3-GPU Mesh \u2014 Hardware, Benchmarks, & Coexistence\r\n==================================================\r\n\r\n| **Status:** 3-node mesh live on port 8320 (BEND)\r\n| **Wire substrate:** lumbda's ``bend`` primitive (lumbda.com_)\r\n| **Coordinator pattern:** ``coordinator-mesh.py`` \u2014 cost-routed fan-out\r\n\r\n.. _lumbda.com: https://lumbda.com/bend.html\r\n\r\n----\r\n\r\n.. contents::\r\n   :depth: 2\r\n   :local:\r\n\r\n----\r\n\r\nWhat this page tracks\r\n---------------------\r\n\r\nOur 3-GPU foxhop pool serves dual duty: research workloads (lumbda's\r\nbend primitive dispatches CUDA work over our wire protocol) & LLM\r\ninference (qwen, hermes, speech, ollama). Both classes coexist on our\r\nsame hardware with VRAM partitioning per node. This page documents our\r\nhardware, our published benchmarks, & our coexistence pattern.\r\n\r\nFor our ecdsa research context that drove our first 18-cell bench\r\n(refined Bernstein-Yang point-add variant sweep at p=251), see\r\n`foxhop secp256k1 lever sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-040140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__.\r\n\r\n----\r\n\r\nHardware\r\n--------\r\n\r\nThree Nvidia GPUs across three hosts on our foxhop LAN:\r\n\r\n========================  ===============  ====  =========  ============================\r\nhost                      GPU              VRAM  arch       role\r\n========================  ===============  ====  =========  ============================\r\n``3090-ai.foxhop.net``    RTX 3090         24 GB sm_86      bend mesh + Hermes LLM\r\n``ai.foxhop.net``         RTX 4090         24 GB sm_89      bend mesh + qwen LLM + speech TTS\r\n``cammy.foxhop.net``      Tesla P40        24 GB sm_61      bend mesh + ollama (qwen3-vl)\r\n========================  ===============  ====  =========  ============================\r\n\r\n72 GB combined VRAM across our pool. Pascal (sm_61) P40 lacks our\r\nnewer architectures' reduced-precision tensor units, so it runs ~2\u00d7\r\nslower than Ampere 3090 on identical workloads \u2014 useful as a third\r\nconcurrent stream rather than a faster replacement for either Ampere\r\ncard.\r\n\r\n----\r\n\r\nPort 8320 \u2014 BEND\r\n----------------\r\n\r\nOur pool serves bend dispatches on port **8320** across every host.\r\nMnemonic \u2014 8320 spells BEND:\r\n\r\n::\r\n\r\n    8 ~= B (implied infinity B flattened; bake a cake; baby & me)\r\n    3 ~= E (backward)\r\n    2 ~= N (pivoted 90 degrees)\r\n    0 ~= D (flattened)\r\n\r\nEach node runs ``gpu-worker.lsp`` (lumbda's reference CUDA worker)\r\nagainst demo_ops, blake3-fanout, radix-sort, & our other registered\r\nCUDA forms. Clients reach our mesh via lumbda's ``bend`` primitive over\r\nTCP, S-expression wire format, & per-form binary protocols (``BSHK``\r\nfor shake fanout, ``BSCP`` for batched scalar mul, ``BSRT`` for radix\r\nsort, etc. \u2014 see lumbda.com_/bend for our form catalog).\r\n\r\n----\r\n\r\nBenchmark \u2014 6 variants \u00d7 3 hosts at p=251\r\n------------------------------------------\r\n\r\nOur first published benchmark dispatched 6 ecdsa point-add variants\r\n(Bernstein-Yang lever sweep) at p=251 to every host through our wire\r\nprotocol. \u03a3 Toffoli matches **byte-for-byte** across hosts on identical\r\nops.bin inputs \u2014 proves our distributed mesh runs from one canonical\r\nlumbda environment.\r\n\r\nGPU wall (ms/batch) at 1024 shots, lower wins:\r\n\r\n========================  ============  =========  =========  =========\r\nvariant                   \u03a3 Tof / shot  3090 ms    4090 ms    P40 ms\r\n========================  ============  =========  =========  =========\r\nv-fermat-schoolbook       167 984       290.1      149.2      531.7\r\nv-fermat-solinas           84 208       151.4       86.3      288.6\r\nv-by-text-schoolbook       45 104        68.4       39.3      131.8\r\nv-by-text-solinas          40 176        60.6       35.6      119.3\r\nv-by-ref-schoolbook        29 104        32.5       25.7       85.1\r\nv-by-ref-solinas           24 176        26.5       22.0       64.0\r\n========================  ============  =========  =========  =========\r\n\r\nPer-architecture pattern reads cleanly:\r\n\r\n- **Ampere 4090 (sm_89)** runs 1.94\u00d7 faster than 3090 (sm_86) on our\r\n  heaviest circuit, shrinking to 1.20\u00d7 on our lightest. Larger circuits\r\n  amortize kernel-launch overhead; smaller circuits hit our GPU's\r\n  overhead floor.\r\n- **Pascal P40 (sm_61)** runs ~2\u00d7 slower than 3090 across our entire\r\n  variant chain. Older arch, lower base clock, no reduced-precision\r\n  tensor units; still serves a third concurrent stream as our candidate\r\n  queue deepens.\r\n\r\nGPU wall scales linearly with \u03a3 Toffoli \u2014 no kernel-level surprise.\r\nCross-host \u03a3 Toffoli identity confirms bit-exact reproducibility across\r\nour pool.\r\n\r\n----\r\n\r\nCoordinator fan-out\r\n-------------------\r\n\r\n``ecdsa/scripts/coordinator-mesh.py`` dispatches candidates across our\r\n3-node mesh concurrently. Greedy bin-pack by predicted \u03a3 Toffoli; each\r\ncandidate lands on whichever node's finish time after add stays lowest.\r\nPer-host speed factor derived from our bench above:\r\n\r\n- 3090: factor 1.00 (baseline)\r\n- 4090: factor 0.51\r\n- P40:  factor 1.84\r\n\r\nOne ThreadPoolExecutor worker per node fires concurrently against our\r\nwire protocol. Each node serializes its own queue (workers do not\r\nmultiplex requests cleanly inside one CUDA context).\r\n\r\nFirst-run finding: cost model based only on (predicted \u03a3 Toffoli \u00d7\r\nhost factor) misses our per-request overhead floor (~600 ms on cammy\r\nfor demo_ops process spawn + Pascal CUDA init). Refinement options: a\r\nper-host startup constant plus a \u03a3 Toffoli linear term, fit per node\r\nfrom our sweep history. Coordinator runs correctly today; schedule\r\nshape needs calibration, not algorithmic change.\r\n\r\n----\r\n\r\nCoexistence with LLM services\r\n------------------------------\r\n\r\nEach host on our pool carries an LLM service alongside its bend worker\r\nslot. Bend workloads launch on demand; LLMs hold VRAM persistently. We\r\npartition VRAM per host:\r\n\r\n================ ================ ===========================================================\r\nhost             LLM workload     VRAM partition\r\n================ ================ ===========================================================\r\n3090-ai          Hermes vllm      Hermes-3-Llama-3.1-8B-FP8-Dynamic, KV 82 K tokens, 22.7 GB\r\nai (4090)        qwen llama.cpp   Qwen3.6-27B-UD-Q4_K_XL.gguf, ``-c 32768`` KV, 16.8 GB\r\nai (4090)        speech F5-TTS    f5-tts (336M params), lazy load, 1.3-3.3 GB\r\ncammy (P40)      ollama qwen3-vl  qwen3-vl:8b, lazy load on first request\r\n================ ================ ===========================================================\r\n\r\nVRAM partition decisions:\r\n\r\n- 4090 carries both qwen LLM & speech TTS. qwen LLM eats 16.8 GB; F5-TTS\r\n  needs ~3 GB headroom for inference spikes. Reduced qwen context from\r\n  ``-c 65536`` to ``-c 32768`` (frees ~3.7 GB) so F5-TTS sees enough\r\n  inference room without OOM. ``tts-1-qwen`` registration in speech.py\r\n  stays disabled (qwen3-tts at 1.7B params claimed 6-7 GB more than\r\n  F5-TTS at 336M).\r\n- 3090 carries Hermes only \u2014 no contention since bench workloads run\r\n  on demand & release VRAM at request boundary.\r\n- P40 carries ollama for qwen3-vl vision/language requests. Lazy load\r\n  pattern (ollama only holds VRAM while serving a request).\r\n\r\nWhen our 3-GPU pool runs a bend benchmark, LLM workloads stay live;\r\nbench transient memory drops slot into available headroom per host.\r\nLong-running bench sweeps (e.g., variant emission at production scale)\r\ndisplace an LLM cleanly when needed \u2014 ``systemctl stop llama-qwen``\r\nfrees 16.8 GB on demand.\r\n\r\n----\r\n\r\nCaddy ingress\r\n-------------\r\n\r\nPublic access routes through ``proxy.unturf.com``. Per-service Caddy\r\nblocks gate behavior:\r\n\r\n==================================  ===================================================\r\ningress                             gate\r\n==================================  ===================================================\r\n``qwen.ai.unturf.com``              Bearer token (``sk-unturf-*`` or ``sk-friends-*``);\r\n                                    foxhop LAN egress (138.207.194.52) bypasses key.\r\n``hermes.ai.unturf.com``            (TBD per workload)\r\n``speech.ai.unturf.com``            Rate-limited reverse_proxy; foxhop LAN bypasses\r\n                                    rate limit.\r\n==================================  ===================================================\r\n\r\nBend wire on port 8320 stays LAN-only; we do **not** expose our bend\r\nmesh outside our foxhop network. External research workloads that want\r\nour pool's throughput route through a foxhop-egress relay.\r\n\r\n----\r\n\r\nPool capacity math\r\n------------------\r\n\r\nCoordinator at saturation. Three nodes scoring three different\r\ncandidates simultaneously closes our sequential 6-variant 3090 wall\r\n(629 ms) into a parallel one. Refined-Solinas runs in ~25 ms on 3090\r\n+ 22 ms on 4090 + 64 ms on P40 \u2014 P40 lags but contributes a third\r\nconcurrent stream as our candidate queue deepens. Per-host routing\r\n(heavy circuits to faster GPUs, light circuits to P40) maximizes\r\naggregate dispatch.\r\n\r\nPool throughput grows with our candidate generation rate \u2014 see\r\necdsa.rst's \"Why CPU Produces & GPU Scores\" section for our\r\nsubstrate-level math on candidate-emit vs candidate-score throughput.\r\n\r\n----\r\n\r\nRelated pages\r\n-------------\r\n\r\n- `foxhop secp256k1 lever sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-040140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__ \u2014 research context;\r\n  full 18-cell cross-scale Pareto sweep; lumbda emit pipeline lift.\r\n- `lumbda.com/bend <https://lumbda.com/bend.html>`__ \u2014 bend primitive\r\n  catalog; wire protocol; CUDA form list.\r\n\r\n----\r\n\r\n| *Updated 2026-06-07.*", "source_format": "rst", "revision_number": 1, "created": 1780841003135}}