<table class="provenance-header" style="border: 0; border-collapse: collapse; margin: 0 0 16px 0; width: 100%;">
<tr style="border: 0;">
<td style="border: 0; vertical-align: top; padding: 0 24px 0 0;">

> **Source:** [https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh](https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh)  
> **Snapshot:** 2026-06-07T17:43:12Z  
> **Generator:** Remarkbox `50b9d1e`  
>
> *This is a thread snapshot. The living document lives at the source URI above — it may have been edited, extended, or replied-to since.*

</td>
<td style="border: 0; vertical-align: top; width: 200px; text-align: right;">

![Scan for living source](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAKQAAACkAQAAAAAxzrjsAAABaElEQVR42tWXQY7DQAgEW/sB/v9LfsBSjSPlkMtKfchOotjjQwcVDYM1H1bri59Kqq6pKu2nvP3Rp/WXp5qaqW717G0/24juBrliG+XetLcxXfV+EVRYt6Y3VmOO6S7QpevMTZAvBnhbMT+wZBteyLgvw3cDXLYbeN0uo7t6m69GzpHvJcMXDyBLqDzpkB+KECHbXPhJ+UxmDJENthSL90lbN6BTfDdllFi78SxopTj08WWzRiOBId1yKZdZDMWX4WCmug5J5YX4upvRfXHYJi/lX0wgE3C8Mf8SZV9fn8Ob8hldwb6w11J8m8y9SvlxSKafHVVOonJBh84LDmKdz+wMTaouhojrzg24pHTlSMttzTxS/cyyA15sUbl6w8Buasr1B7giTWW4NiY2PyDtjN0MEZt3ykenxyj/QW4+o+e0y62Sc5/PNqDQ5JPzZDNPYjWsEePLYTwPYCXnSUymZzIJzpP/5z3rF2oMwgzq0at0AAAAAElFTkSuQmCC)

</td>
</tr>
</table>

# 3-GPU Mesh --- Hardware & Benchmarks

<div>

**Status:** 3-node mesh live on port 8320 (BEND)\
**Wire substrate:** lumbda\'s `bend` primitive
([lumbda.com](https://lumbda.com/bend.html){rel="nofollow"
target="_blank"})\
**Coordinator pattern:** cost-routed fan-out via greedy bin-pack

</div>

------------------------------------------------------------------------

<div>

</div>

------------------------------------------------------------------------

## What this page tracks

Three Nvidia GPUs across three hosts on our foxhop LAN serve as a bend
mesh --- lumbda\'s CUDA dispatch primitive routes work to whichever node
finishes fastest. This page documents our hardware, our published
benchmarks, & our cost-routed coordinator pattern.

For our ecdsa research context that drove our first 18-cell bench
(refined Bernstein-Yang point-add variant sweep at p=251), see [foxhop
secp256k1 lever
sweep](https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-040140774501/secp256k1-point-addition-challenge-with-lumbda-attack){rel="nofollow"
target="_blank"}.

------------------------------------------------------------------------

## Hardware

  host                   GPU         VRAM    arch
  ---------------------- ----------- ------- -------
  `3090-ai.foxhop.net`   RTX 3090    24 GB   sm_86
  `ai.foxhop.net`        RTX 4090    24 GB   sm_89
  `cammy.foxhop.net`     Tesla P40   24 GB   sm_61

72 GB combined VRAM across our pool. Pascal (sm_61) P40 lacks our newer
architectures\' reduced-precision tensor units, so it runs \~2× slower
than Ampere 3090 on identical workloads --- useful as a third concurrent
stream rather than a faster replacement for either Ampere card.

------------------------------------------------------------------------

## Port 8320 --- BEND

Our pool serves bend dispatches on port **8320** across every host.
Mnemonic --- 8320 spells BEND:

    8 ~= B (implied infinity B flattened; bake a cake; baby & me)
    3 ~= E (backward)
    2 ~= N (pivoted 90 degrees)
    0 ~= D (flattened)

Each node runs `gpu-worker.lsp` (lumbda\'s reference CUDA worker)
against demo_ops, blake3-fanout, radix-sort, & our other registered CUDA
forms. Clients reach our mesh via lumbda\'s `bend` primitive over TCP,
S-expression wire format, & per-form binary protocols (`BSHK` for shake
fanout, `BSCP` for batched scalar mul, `BSRT` for radix sort, etc. ---
see [lumbda.com](https://lumbda.com/bend.html){rel="nofollow"
target="_blank"}/bend for our form catalog). Our bend wire stays
LAN-only; we do **not** expose our mesh outside our foxhop network.

------------------------------------------------------------------------

## Benchmark --- 6 variants × 3 hosts at p=251

Our first published benchmark dispatched 6 ecdsa point-add variants
(Bernstein-Yang lever sweep) at p=251 to every host through our wire
protocol. Σ Toffoli matches **byte-for-byte** across hosts on identical
ops.bin inputs --- proves our distributed mesh runs from one canonical
lumbda environment.

GPU wall (ms/batch) at 1024 shots, lower wins:

+-------------+-------------+-------------+-------------+-------------+
| variant     | Σ Tof /     | 3090 ms     | 4090 ms     | P40 ms      |
|             | shot        |             |             |             |
+=============+=============+=============+=============+=============+
| v-fermat    | 167 984     | 290.1       | 149.2       | 531.7       |
| -schoolbook |             |             |             |             |
+-------------+-------------+-------------+-------------+-------------+
| v-fer       | > 84 208    | 151.4       | > 86.3      | 288.6       |
| mat-solinas |             |             |             |             |
+-------------+-------------+-------------+-------------+-------------+
| v-by-text   | > 45 104    | > 68.4      | > 39.3      | 131.8       |
| -schoolbook |             |             |             |             |
+-------------+-------------+-------------+-------------+-------------+
| v-by-t      | > 40 176    | > 60.6      | > 35.6      | 119.3       |
| ext-solinas |             |             |             |             |
+-------------+-------------+-------------+-------------+-------------+
| v-by-ref    | > 29 104    | > 32.5      | > 25.7      | > 85.1      |
| -schoolbook |             |             |             |             |
+-------------+-------------+-------------+-------------+-------------+
| v-by-       | > 24 176    | > 26.5      | > 22.0      | > 64.0      |
| ref-solinas |             |             |             |             |
+-------------+-------------+-------------+-------------+-------------+

Per-architecture pattern reads cleanly:

-   **Ampere 4090 (sm_89)** runs 1.94× faster than 3090 (sm_86) on our
    heaviest circuit, shrinking to 1.20× on our lightest. Larger
    circuits amortize kernel-launch overhead; smaller circuits hit our
    GPU\'s overhead floor.
-   **Pascal P40 (sm_61)** runs \~2× slower than 3090 across our entire
    variant chain. Older arch, lower base clock, no reduced-precision
    tensor units; still serves a third concurrent stream as our
    candidate queue deepens.

GPU wall scales linearly with Σ Toffoli --- no kernel-level surprise.
Cross-host Σ Toffoli identity confirms bit-exact reproducibility across
our pool.

------------------------------------------------------------------------

## Coordinator fan-out

A cost-routed coordinator dispatches candidates across our 3-node mesh
concurrently. Greedy bin-pack by predicted Σ Toffoli; each candidate
lands on whichever node\'s finish time after add stays lowest. Per-host
speed factor derived from our bench above:

-   3090: factor 1.00 (baseline)
-   4090: factor 0.51
-   P40: factor 1.84

One worker thread per node fires concurrently against our wire protocol.
Each node serializes its own queue (workers do not multiplex requests
cleanly inside one CUDA context).

First-run finding: cost model based only on (predicted Σ Toffoli × host
factor) misses our per-request overhead floor (\~600 ms on cammy for
demo_ops process spawn + Pascal CUDA init). Refinement options: a
per-host startup constant plus a Σ Toffoli linear term, fit per node
from our sweep history. Coordinator runs correctly today; schedule shape
needs calibration, not algorithmic change.

------------------------------------------------------------------------

## Pool capacity math

Coordinator at saturation. Three nodes scoring three different
candidates simultaneously closes our sequential 6-variant 3090 wall (629
ms) into a parallel one. Refined-Solinas runs in \~25 ms on 3090 + 22 ms
on 4090 + 64 ms on P40 --- P40 lags but contributes a third concurrent
stream as our candidate queue deepens. Per-host routing (heavy circuits
to faster GPUs, light circuits to P40) maximizes aggregate dispatch.

Pool throughput grows with our candidate generation rate --- see our
ecdsa research page\'s \"Why CPU Produces & GPU Scores\" section for our
substrate-level math on candidate-emit vs candidate-score throughput.

------------------------------------------------------------------------

## Related pages

-   [foxhop secp256k1 lever
    sweep](https://www.foxhop.net/5243e3fe-6146-11f1-8ce9-040140774501/secp256k1-point-addition-challenge-with-lumbda-attack){rel="nofollow"
    target="_blank"} --- research context; full 18-cell cross-scale
    Pareto sweep; lumbda emit pipeline lift.
-   [lumbda.com/bend](https://lumbda.com/bend.html){rel="nofollow"
    target="_blank"} --- bend primitive catalog; wire protocol; CUDA
    form list.

------------------------------------------------------------------------

<div>

*Updated 2026-06-07.*

</div>


---

**Source:** [https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh](https://foxhop.net/053ba7da-6277-11f1-82fc-040140774501/gpu-mesh)  
**Snapshot:** 2026-06-07T17:43:12Z  
**Generator:** Remarkbox `50b9d1e`
