Wrapping cloud and bare-metal backends

audience: ai

A compute-bridge’s edge comes from spanning backends no single cloud matches: AWS plus GCP plus Azure plus bare-metal behind one provider identity, covering regions, pricing, and TDX hardware profiles a single cloud cannot. This chapter walks the pattern that makes that possible: a small Backend trait and a Fleet router.

The question: when the bridge’s edge is multi- backend coverage, how are four radically different cloud and bare-metal APIs folded into one ProviderCard the Compute market can reason about?

The `Backend` trait

From src/backends/mod.rs:

#[async_trait]
pub trait Backend: Send + Sync {
    fn name(&self) -> &'static str;
    async fn capabilities(&self) -> anyhow::Result<Capabilities>;
    fn can_satisfy(&self, grant: &ComputeGrant<'_>) -> bool;
    async fn provision(
        &self,
        grant: &ComputeGrant<'_>,
        envelope: &Envelope,
    ) -> anyhow::Result<ProvisionedInstance>;
    async fn watch_until_exit(
        &self,
        instance: &ProvisionedInstance,
        valid_to: AlmanacTick,
    ) -> anyhow::Result<UsageMetrics>;
    async fn terminate(
        &self, instance: &ProvisionedInstance,
    ) -> anyhow::Result<()>;
}

Six methods. A new backend class (Oracle Cloud, Hetzner, a decentralised-compute protocol) slots in by implementing these six and appending itself to Fleet::from_boot_config. The trait is narrow on purpose.

The `Fleet` router

Also in src/backends/mod.rs:

pub struct Fleet {
    backends: Arc<Vec<Arc<dyn Backend>>>,
}

impl Fleet {
    pub async fn from_boot_config(
        cfg: &BackendsBootConfig,
    ) -> anyhow::Result<Self> {
        let mut backends: Vec<Arc<dyn Backend>> =
            Vec::new();

        if let Some(aws) = &cfg.aws {
            backends.push(Arc::new(
                aws::AwsBackend::new(aws).await?,
            ));
        }
        if let Some(gcp) = &cfg.gcp {
            backends.push(Arc::new(
                gcp::GcpBackend::new(gcp).await?,
            ));
        }
        if let Some(azure) = &cfg.azure {
            backends.push(Arc::new(
                azure::AzureBackend::new(azure).await?,
            ));
        }
        if let Some(bm) = &cfg.baremetal {
            backends.push(Arc::new(
                baremetal::BareMetalBackend::new(bm).await?,
            ));
        }

        if backends.is_empty() {
            anyhow::bail!(
                "no backends configured — enable at least \
                 one of aws / gcp / azure / baremetal in \
                 the boot config"
            );
        }
        Ok(Self { backends: Arc::new(backends) })
    }

    pub async fn provision_for_grant(
        &self,
        grant: &ComputeGrant<'_>,
        envelope: &Envelope,
    ) -> anyhow::Result<ProvisionedInstance> {
        for b in self.backends.iter() {
            if b.can_satisfy(grant) {
                return b.provision(grant, envelope).await;
            }
        }
        anyhow::bail!(
            "no backend can satisfy grant {:?}",
            grant.request_id,
        )
    }
}

First match wins. Operators who want a preference order (cheapest first, lowest-latency first, fewest-active-grants first) put the backends in that order. A more sophisticated router is a policy the operator can layer on top; the book leaves the decision open.

The four backends

Each backend is one file in src/backends/. Representative shapes:

AWS

// src/backends/aws.rs

pub struct AwsBackend {
    ec2:    aws_sdk_ec2::Client,
    cfg:    AwsBootConfig,
}

#[async_trait]
impl Backend for AwsBackend {
    fn name(&self) -> &'static str { "aws" }

    async fn capabilities(&self) -> anyhow::Result<Capabilities> {
        Ok(Capabilities {
            regions:      self.cfg.regions.clone(),
            tdx_capable:  false, // pending AWS Nitro-TDX GA
            max_cpu_millicores: self.cfg.max_concurrent_instances
                                    * cores_per_family(&self.cfg.instance_families),
            max_ram_mib:  self.cfg.max_concurrent_instances
                          * ram_per_family(&self.cfg.instance_families),
        })
    }

    fn can_satisfy(&self, grant: &ComputeGrant<'_>) -> bool {
        !grant.tdx_required
            && self.cfg.regions.contains(&grant.region)
    }

    async fn provision(
        &self,
        grant: &ComputeGrant<'_>,
        envelope: &Envelope,
    ) -> anyhow::Result<ProvisionedInstance> {
        // 1. Pick an instance family that fits.
        // 2. RunInstances with cloud-init fetching the
        //    image pointer and verifying the hash.
        // 3. Wait for IP; poll ssh readiness.
        // 4. Build ProvisionedInstance with per-grant
        //    SSH key.
        todo!("AwsBackend::provision — see crate source")
    }
    // … watch_until_exit, terminate
}

GCP

Similar shape; the TDX difference is in the flag:

// src/backends/gcp.rs

#[async_trait]
impl Backend for GcpBackend {
    fn name(&self) -> &'static str { "gcp" }

    async fn capabilities(&self) -> anyhow::Result<Capabilities> {
        Ok(Capabilities {
            regions:     self.cfg.regions.clone(),
            tdx_capable: !self.cfg.tdx_machine_types.is_empty(),
            max_cpu_millicores: self.cfg.max_concurrent_instances
                                    * cores_per_machine_family(
                                        &self.cfg.machine_families,
                                    ),
            max_ram_mib: self.cfg.max_concurrent_instances
                          * ram_per_machine_family(
                              &self.cfg.machine_families,
                          ),
        })
    }

    fn can_satisfy(&self, grant: &ComputeGrant<'_>) -> bool {
        if grant.tdx_required && self.cfg.tdx_machine_types.is_empty() {
            return false;
        }
        self.cfg.regions.contains(&grant.region)
    }
    // …
}

Azure

Mirror of GCP. Confidential Computing v3 SKUs (Standard_DCadsv5, Standard_ECadsv5) are the TDX-capable family; non-TDX SKUs are the default.

Bare-metal

The most distinct backend. No cloud API. The bridge keeps SSH root sessions to operator-owned hosts and provisions workloads with systemd-run (bare VMs) or virsh plus qemu-tdx (bare-TDX hosts with nested guests).

// src/backends/baremetal.rs

pub struct BareMetalBackend {
    machines: Vec<BareMetalMachine>,
    sessions: DashMap<String, russh::Session>,
}

#[async_trait]
impl Backend for BareMetalBackend {
    fn name(&self) -> &'static str { "baremetal" }

    async fn capabilities(&self) -> anyhow::Result<Capabilities> {
        let regions: Vec<_> = self.machines.iter()
            .map(|m| m.region.clone())
            .collect::<HashSet<_>>().into_iter().collect();
        let tdx_capable = self.machines.iter()
            .any(|m| m.tdx_capable);
        Ok(Capabilities {
            regions,
            tdx_capable,
            max_cpu_millicores: self.machines.iter()
                .map(|m| m.cpu_millicores).sum(),
            max_ram_mib: self.machines.iter()
                .map(|m| m.ram_mib).sum(),
        })
    }
    // …
}

Bare-metal covers two recurring cases. One is TDX hardware not yet available on cloud: a bare-TDX host with nested-guest attestation is the path for operators running early-access TDX hardware or jurisdictions where cloud TDX is not generally available. The other is amortised hosting cost: operators who already run hardware (a colocation rack, a lab, a private datacentre) can serve TDX- required workloads without adding a cloud dependency.

The capability union

Fleet::capabilities() aggregates every enabled backend:

pub async fn capabilities(&self)
    -> anyhow::Result<Vec<(String, Capabilities)>>
{
    let mut out = Vec::with_capacity(self.backends.len());
    for b in self.backends.iter() {
        let caps = b.capabilities().await?;
        out.push((b.name().to_string(), caps));
    }
    Ok(out)
}

The provider card folds this into a single capability summary. Requesters see the union of regions, a TDX-capable flag that is true if any backend is TDX-capable, and CPU/RAM maxes summed across backends. One bridge identity competes with cloud-only or bare-metal-only providers because the card promises whatever the union promises, regardless of which backend ends up satisfying the grant.

Honest limits

The Backend trait cannot make AWS honest about its real instance availability. capabilities() reports declared caps — the operator’s max_concurrent_instances tallied — not a live cloud query. Grants that land during cloud-side exhaustion fail to provision; the Compute module scores the bridge down via the missing ComputeLog.

A bridge running AWS and GCP does not share credentials across backends. Each backend’s config holds distinct credentials with distinct scope; a compromise of one does not bleed into the others (modulo operator hygiene).

Cloud RunInstances / Insert / CreateOrUpdate calls add tens of seconds to the grant-to-SSH-ready path. Bare-metal’s pre- established SSH sessions are the fastest. Latency sensitivity is a request-side policy.

A grant provisioned on AWS cannot migrate transparently to GCP mid-run if AWS has a region outage. The bridge terminates the AWS instance, emits the partial ComputeLog, and the requester resubmits; the next grant lands on a different backend if AWS remains out.

Adding a new backend

Three steps to add a fifth backend class:

Add a src/backends/<name>.rs implementing Backend.
Add a [<name>] section to the boot TOML’s backends deserialise target.
Append the new backend to Fleet::from_boot_config.

The bridge organism’s Config.content folds which backends the binary drives, so the backend set is part of the image’s measured identity; a bridge cannot quietly add a backend the requester did not expect.

ai on mosaik