Provisioning on demand

audience: ai

Reads feed observation (market-reads); the fleet wraps the cloud APIs (wrapping); this chapter is the grant-handling work. The provider loop decrypts each grant’s shuffle envelope, cross-checks the image hash, routes through the fleet to a backend, and emits the usage log when the workload terminates.

web3’s matching chapter is inference on Compute — the searcher acquires compute from the Compute module, the bridge serves it. Both bond against the same module’s Config; both fail the same way if the composition does not clear.

The provider loop

From src/provider.rs:

pub async fn run(self) -> anyhow::Result<()> {
    let provider_id = self.register_provider_card().await?;

    let mut grants = coalition_compute::grants_for(
        &self.network, self.compute, provider_id,
    ).await?;

    let mut refresh = tokio::time::interval(
        Duration::from_secs(
            self.config.capacity_refresh_sec as u64,
        ),
    );

    loop {
        tokio::select! {
            Some(grant) = grants.next() => {
                if let Err(err) = self.handle_grant(&grant).await {
                    tracing::warn!(
                        request_id = ?grant.request_id,
                        error = %err,
                        "grant handling failed",
                    );
                }
            }
            _ = refresh.tick() => {
                let _ = self.refresh_provider_card(provider_id).await;
            }
            else => break,
        }
    }
    Ok(())
}

Grant handling and card refresh run in the same select, one per tick.

Handling one grant

The handle_grant function in the same file:

async fn handle_grant(
    &self, grant: &ComputeGrant<'_>,
) -> anyhow::Result<()> {
    // 1. Decrypt the shuffle envelope to learn the
    //    requester's peer_id and image payload.
    let envelope = self.zipnet
        .resolve(&grant.bearer_pointer).await?;

    // 2. Cross-check the envelope's image hash against
    //    the grant's. Under honest unseal this should
    //    never fire; firing means the scheduler
    //    committee is compromised or the unseal
    //    quorum broke.
    if envelope.image_hash() != grant.image_hash {
        anyhow::bail!(
            "envelope/grant image hash mismatch"
        );
    }

    // 3. Route through the fleet.
    let instance = self.fleet
        .provision_for_grant(grant, &envelope).await?;

    // 4. Seal an SSH receipt to the requester's x25519
    //    public key; return via the shuffle.
    let receipt = SshAccessReceipt::build(&instance, grant)?;
    let sealed = receipt.seal_to(envelope.peer_x25519_public())?;
    self.zipnet.reply(&grant.request_id, sealed).await?;

    // 5. Spawn a watcher that emits the ComputeLog on
    //    instance exit or deadline.
    let fleet = self.fleet.clone();
    let network = self.network.clone();
    let request_id = grant.request_id;
    let valid_to = grant.valid_to;
    let instance_clone = instance.clone();
    let provider_id = instance.provider_id();
    tokio::spawn(async move {
        let usage = fleet
            .watch_until_exit(&instance_clone, valid_to)
            .await.unwrap_or_default();

        let log = ComputeLog {
            grant_id: request_id,
            provider: provider_id,
            window:   UsageMetrics::window_for(&usage),
            usage:    usage.clone(),
            evidence: None,
        };
        let _ = coalition_compute::append_log(&network, &log).await;
    });

    Ok(())
}

Five steps. Each is a self-check.

1. Decrypt the shuffle envelope

The grant carries a bearer_pointer — a blake3 pointer into the shuffle-sealed envelope the requester submitted. The unseal committee, majority-honest, makes cleartext available to the addressed provider:

// src/zipnet_io.rs — Envelope shape
pub struct Envelope {
    peer_id:          [u8; 32],
    peer_x25519:      [u8; 32],
    image_hash:       UniqueId,
    requested_region: Option<String>,
    image_pointer:    Vec<u8>,
}

The bridge sees a rotating peer_id (so the requester’s coalition identity stays hidden), the requester’s x25519 public key for the receipt, the image hash to serve, an optional region hint, and a pointer to fetch the image contents. It does not see the requester’s ClientId, the bid value, or which other providers were considered before clearing.

2. Image-hash cross-check

The envelope’s declared image_hash and the grant’s committed image_hash must match. A mismatch means either the unseal quorum returned the wrong cleartext or the scheduler committee broke. Neither is supposed to happen under honest operation. The check is defensive. When it fires the bridge aborts the grant without attempting to provision.

3. Fleet routing

The fleet picks the first backend whose can_satisfy returns true (see wrapping — the Fleet router). The returned ProvisionedInstance.backend identifies which backend served the grant. The bridge records it on the dashboard:

self.dashboard.record(DashboardEvent::GrantAccepted {
    backend: instance.backend.to_string(),
}).await;

No requester identity. No instance id.

4. Sealed receipt

Chapter 6 (receipts) covers sealing. Here all that matters: the receipt is sealed to the requester’s x25519 public key published in the envelope, and the shuffle reply channel carries the sealed blob.

5. Usage-log watcher

A tokio::spawned task runs fleet.watch_until_exit for the duration of the grant. When the instance terminates — either because the workload completed or because valid_to passed — the watcher collects metrics and appends a ComputeLog to the Compute module’s log stream.

The ComputeLog is what the scheduler committee and the reputation organism both read:

pub struct ComputeLog<'a> {
    pub grant_id: UniqueId,
    pub provider: ProviderId,
    pub window:   AlmanacRange,
    pub usage:    UsageMetrics,
    pub evidence: Option<EvidencePointer<'a>>,
}

pub struct UsageMetrics {
    pub cpu_core_seconds: u64,
    pub ram_mib_seconds:  u64,
    pub net_bytes:        u64,
}

The log names the grant and the provider so the committee can correlate cleared grants to completed workloads. Grants that clear to a provider but never appear in the log stream score the provider down on the reputation organism. The bridge reports measured cpu-seconds, ram-mib- seconds, and network bytes — what the backend actually observed. A bridge that over-reports is auditable because the committee can cross-check against the cloud API’s own usage records (when the committee has been granted access; otherwise the reputation organism is the only feedback). The evidence field points into the cloud’s API-side telemetry when available (typically None for bare-metal, a signed telemetry snapshot for clouds).

Usage honesty

The ComputeLog stream is what keeps a bridge running. Every active window produces logs; the reputation organism (chapter 7) scores them; the scheduler committee’s next clearing consults the score.

A bridge whose logs arrive late (after the grant deadline) indicates dropped grants. Under-reporting — the requester’s SSH session measures more cpu-seconds than the declared total — indicates cheating on billing. Over-reporting means inflated bills. None of these is caught inside Compute; the reputation organism reading the stream catches them. A bridge optimising for persistence optimises for log honesty.

Capacity refresh

In parallel with grant handling, the provider refreshes the card on a timer:

async fn refresh_provider_card(
    &self, id: ProviderId,
) -> anyhow::Result<()> {
    let capabilities = self.fleet.capabilities().await?;
    let card = ProviderCard {
        provider_id:    id,
        tdx_quote:      self.tdx_quote.clone(),
        capabilities,
        declared_rates: self.config.declared_rates,
        zipnet_reply:   self.zipnet.reply_pointer(),
        refreshed_at:   self.network.almanac().tick(),
    };
    coalition_compute::register(
        &self.network, self.compute, card,
    ).await.map(|_| ())
}

capacity_refresh_sec in ProviderBootConfig sets the cadence. Too slow and rate changes take long to reach the market; the card carries stale capacity during fast-moving demand. Too fast and every refresh commits to the ProviderCard collection, driving up the module’s read-side bandwidth. Operators settle in the 30–120 second range.

Grant failure modes

A handling that fails in any step produces a missing or partial ComputeLog:

Shuffle resolve failure — unseal quorum did not return cleartext, or the bearer pointer was malformed. The bridge has nothing to serve; no log is emitted; the committee scores the bridge down only if this correlates with a reputation- organism observation that the bridge should have served the grant.
Image-hash mismatch — step 2 fires. The bridge bails; no log.
Fleet provision_for_grant fails — no backend could satisfy (cloud-side exhaustion; every eligible backend returns false). The bridge emits a log with empty usage to signal acknowledgement; reputation organisms can distinguish “tried but failed” from “never responded”.
Workload crashes pre-SSH — same empty-usage log.
Workload exceeds valid_to — the watcher terminates the instance and emits usage up to valid_to. The requester renews.

No market-maker variant

The market-maker variant exists on the consumer side because a market-maker’s inference cadence differs from a searcher’s. The provider side does not have one — the bridge serves whatever distribution of grants the market clears to it; rapid quote-resubmission grants and once-a-day training grants go through the same handle_grant path.

ai on mosaik