========
gpu-mesh
========

.. raw:: html

   <table class="provenance-header" style="border: 0; border-collapse: collapse; margin: 0 0 16px 0; width: 100%;">

.. raw:: html

   <tr style="border: 0;">

.. raw:: html

   <td style="border: 0; vertical-align: top; padding: 0 24px 0 0;">

..

   | **Source:**
     https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh
   | **Snapshot:** 2026-06-07T17:38:15Z
   | **Generator:** Remarkbox ``50b9d1e``

   *This is a thread snapshot. The living document lives at the source
   URI above — it may have been edited, extended, or replied-to since.*

.. raw:: html

   </td>

.. raw:: html

   <td style="border: 0; vertical-align: top; width: 200px; text-align: right;">

.. figure:: data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAKQAAACkAQAAAAAxzrjsAAABaElEQVR42tWXQY7DQAgEW/sB/v9LfsBSjSPlkMtKfchOotjjQwcVDYM1H1bri59Kqq6pKu2nvP3Rp/WXp5qaqW717G0/24juBrliG+XetLcxXfV+EVRYt6Y3VmOO6S7QpevMTZAvBnhbMT+wZBteyLgvw3cDXLYbeN0uo7t6m69GzpHvJcMXDyBLqDzpkB+KECHbXPhJ+UxmDJENthSL90lbN6BTfDdllFi78SxopTj08WWzRiOBId1yKZdZDMWX4WCmug5J5YX4upvRfXHYJi/lX0wgE3C8Mf8SZV9fn8Ob8hldwb6w11J8m8y9SvlxSKafHVVOonJBh84LDmKdz+wMTaouhojrzg24pHTlSMttzTxS/cyyA15sUbl6w8Buasr1B7giTWW4NiY2PyDtjN0MEZt3ykenxyj/QW4+o+e0y62Sc5/PNqDQ5JPzZDNPYjWsEePLYTwPYCXnSUymZzIJzpP/5z3rF2oMwgzq0at0AAAAAElFTkSuQmCC
   :alt: Scan for living source

   Scan for living source

.. raw:: html

   </td>

.. raw:: html

   </tr>

.. raw:: html

   </table>

3-GPU Mesh — Hardware & Benchmarks
==================================

.. container::

   | **Status:** 3-node mesh live on port 8320 (BEND)
   | **Wire substrate:** lumbda's ``bend`` primitive
     (`lumbda.com <https://lumbda.com/bend.html>`__)
   | **Coordinator pattern:** cost-routed fan-out via greedy bin-pack

--------------

.. container::

--------------

What this page tracks
---------------------

Three Nvidia GPUs across three hosts on our foxhop LAN serve as a bend
mesh — lumbda's CUDA dispatch primitive routes work to whichever node
finishes fastest. This page documents our hardware, our published
benchmarks, & our cost-routed coordinator pattern.

For our ecdsa research context that drove our first 18-cell bench
(refined Bernstein-Yang point-add variant sweep at p=251), see `foxhop
secp256k1 lever
sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-040140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__.

--------------

Hardware
--------

====================== ========= ===== =====
host                   GPU       VRAM  arch
====================== ========= ===== =====
``3090-ai.foxhop.net`` RTX 3090  24 GB sm_86
``ai.foxhop.net``      RTX 4090  24 GB sm_89
``cammy.foxhop.net``   Tesla P40 24 GB sm_61
====================== ========= ===== =====

72 GB combined VRAM across our pool. Pascal (sm_61) P40 lacks our newer
architectures' reduced-precision tensor units, so it runs ~2× slower
than Ampere 3090 on identical workloads — useful as a third concurrent
stream rather than a faster replacement for either Ampere card.

--------------

Port 8320 — BEND
----------------

Our pool serves bend dispatches on port **8320** across every host.
Mnemonic — 8320 spells BEND:

::

   8 ~= B (implied infinity B flattened; bake a cake; baby & me)
   3 ~= E (backward)
   2 ~= N (pivoted 90 degrees)
   0 ~= D (flattened)

Each node runs ``gpu-worker.lsp`` (lumbda's reference CUDA worker)
against demo_ops, blake3-fanout, radix-sort, & our other registered CUDA
forms. Clients reach our mesh via lumbda's ``bend`` primitive over TCP,
S-expression wire format, & per-form binary protocols (``BSHK`` for
shake fanout, ``BSCP`` for batched scalar mul, ``BSRT`` for radix sort,
etc. — see `lumbda.com <https://lumbda.com/bend.html>`__/bend for our
form catalog). Our bend wire stays LAN-only; we do **not** expose our
mesh outside our foxhop network.

--------------

Benchmark — 6 variants × 3 hosts at p=251
-----------------------------------------

Our first published benchmark dispatched 6 ecdsa point-add variants
(Bernstein-Yang lever sweep) at p=251 to every host through our wire
protocol. Σ Toffoli matches **byte-for-byte** across hosts on identical
ops.bin inputs — proves our distributed mesh runs from one canonical
lumbda environment.

GPU wall (ms/batch) at 1024 shots, lower wins:

+------------+------------+------------+------------+------------+
| variant    | Σ Tof /    | 3090 ms    | 4090 ms    | P40 ms     |
|            | shot       |            |            |            |
+============+============+============+============+============+
| v-fermat   | 167 984    | 290.1      | 149.2      | 531.7      |
| -          |            |            |            |            |
| schoolbook |            |            |            |            |
+------------+------------+------------+------------+------------+
| v-fer      |    84 208  | 151.4      |    86.3    | 288.6      |
| m          |            |            |            |            |
| at-solinas |            |            |            |            |
+------------+------------+------------+------------+------------+
| v-by-text  |    45 104  |    68.4    |    39.3    | 131.8      |
| -          |            |            |            |            |
| schoolbook |            |            |            |            |
+------------+------------+------------+------------+------------+
| v-by-t     |    40 176  |    60.6    |    35.6    | 119.3      |
| e          |            |            |            |            |
| xt-solinas |            |            |            |            |
+------------+------------+------------+------------+------------+
| v-by-ref   |    29 104  |    32.5    |    25.7    |    85.1    |
| -          |            |            |            |            |
| schoolbook |            |            |            |            |
+------------+------------+------------+------------+------------+
| v-by-      |    24 176  |    26.5    |    22.0    |    64.0    |
| r          |            |            |            |            |
| ef-solinas |            |            |            |            |
+------------+------------+------------+------------+------------+

Per-architecture pattern reads cleanly:

-  **Ampere 4090 (sm_89)** runs 1.94× faster than 3090 (sm_86) on our
   heaviest circuit, shrinking to 1.20× on our lightest. Larger circuits
   amortize kernel-launch overhead; smaller circuits hit our GPU's
   overhead floor.
-  **Pascal P40 (sm_61)** runs ~2× slower than 3090 across our entire
   variant chain. Older arch, lower base clock, no reduced-precision
   tensor units; still serves a third concurrent stream as our candidate
   queue deepens.

GPU wall scales linearly with Σ Toffoli — no kernel-level surprise.
Cross-host Σ Toffoli identity confirms bit-exact reproducibility across
our pool.

--------------

Coordinator fan-out
-------------------

A cost-routed coordinator dispatches candidates across our 3-node mesh
concurrently. Greedy bin-pack by predicted Σ Toffoli; each candidate
lands on whichever node's finish time after add stays lowest. Per-host
speed factor derived from our bench above:

-  3090: factor 1.00 (baseline)
-  4090: factor 0.51
-  P40: factor 1.84

One worker thread per node fires concurrently against our wire protocol.
Each node serializes its own queue (workers do not multiplex requests
cleanly inside one CUDA context).

First-run finding: cost model based only on (predicted Σ Toffoli × host
factor) misses our per-request overhead floor (~600 ms on cammy for
demo_ops process spawn + Pascal CUDA init). Refinement options: a
per-host startup constant plus a Σ Toffoli linear term, fit per node
from our sweep history. Coordinator runs correctly today; schedule shape
needs calibration, not algorithmic change.

--------------

Pool capacity math
------------------

Coordinator at saturation. Three nodes scoring three different
candidates simultaneously closes our sequential 6-variant 3090 wall (629
ms) into a parallel one. Refined-Solinas runs in ~25 ms on 3090 + 22 ms
on 4090 + 64 ms on P40 — P40 lags but contributes a third concurrent
stream as our candidate queue deepens. Per-host routing (heavy circuits
to faster GPUs, light circuits to P40) maximizes aggregate dispatch.

Pool throughput grows with our candidate generation rate — see our ecdsa
research page's "Why CPU Produces & GPU Scores" section for our
substrate-level math on candidate-emit vs candidate-score throughput.

--------------

Related pages
-------------

-  `foxhop secp256k1 lever
   sweep <https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-040140774501/secp256k1-point-addition-challenge-with-lumbda-attack>`__
   — research context; full 18-cell cross-scale Pareto sweep; lumbda
   emit pipeline lift.
-  `lumbda.com/bend <https://lumbda.com/bend.html>`__ — bend primitive
   catalog; wire protocol; CUDA form list.

--------------

.. container::

   *Updated 2026-06-07.*

--------------

| **Source:**
  https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh
| **Snapshot:** 2026-06-07T17:38:15Z
| **Generator:** Remarkbox ``50b9d1e``