Wrapping cloud and bare-metal backends
audience: ai
A compute-bridge’s edge comes from spanning
backends no single cloud matches: AWS plus GCP plus
Azure plus bare-metal behind one provider identity,
covering regions, pricing, and TDX hardware
profiles a single cloud cannot. This chapter walks
the pattern that makes that possible: a small
Backend trait and a Fleet router.
The question: when the bridge’s edge is multi-
backend coverage, how are four radically different
cloud and bare-metal APIs folded into one
ProviderCard the Compute market can reason about?
The Backend trait
From
src/backends/mod.rs:
#[async_trait]
pub trait Backend: Send + Sync {
fn name(&self) -> &'static str;
async fn capabilities(&self) -> anyhow::Result<Capabilities>;
fn can_satisfy(&self, grant: &ComputeGrant<'_>) -> bool;
async fn provision(
&self,
grant: &ComputeGrant<'_>,
envelope: &Envelope,
) -> anyhow::Result<ProvisionedInstance>;
async fn watch_until_exit(
&self,
instance: &ProvisionedInstance,
valid_to: AlmanacTick,
) -> anyhow::Result<UsageMetrics>;
async fn terminate(
&self, instance: &ProvisionedInstance,
) -> anyhow::Result<()>;
}
Six methods. A new backend class (Oracle Cloud,
Hetzner, a decentralised-compute protocol) slots in
by implementing these six and appending itself to
Fleet::from_boot_config. The trait is narrow on
purpose.
The Fleet router
Also in
src/backends/mod.rs:
pub struct Fleet {
backends: Arc<Vec<Arc<dyn Backend>>>,
}
impl Fleet {
pub async fn from_boot_config(
cfg: &BackendsBootConfig,
) -> anyhow::Result<Self> {
let mut backends: Vec<Arc<dyn Backend>> =
Vec::new();
if let Some(aws) = &cfg.aws {
backends.push(Arc::new(
aws::AwsBackend::new(aws).await?,
));
}
if let Some(gcp) = &cfg.gcp {
backends.push(Arc::new(
gcp::GcpBackend::new(gcp).await?,
));
}
if let Some(azure) = &cfg.azure {
backends.push(Arc::new(
azure::AzureBackend::new(azure).await?,
));
}
if let Some(bm) = &cfg.baremetal {
backends.push(Arc::new(
baremetal::BareMetalBackend::new(bm).await?,
));
}
if backends.is_empty() {
anyhow::bail!(
"no backends configured — enable at least \
one of aws / gcp / azure / baremetal in \
the boot config"
);
}
Ok(Self { backends: Arc::new(backends) })
}
pub async fn provision_for_grant(
&self,
grant: &ComputeGrant<'_>,
envelope: &Envelope,
) -> anyhow::Result<ProvisionedInstance> {
for b in self.backends.iter() {
if b.can_satisfy(grant) {
return b.provision(grant, envelope).await;
}
}
anyhow::bail!(
"no backend can satisfy grant {:?}",
grant.request_id,
)
}
}
First match wins. Operators who want a preference order (cheapest first, lowest-latency first, fewest-active-grants first) put the backends in that order. A more sophisticated router is a policy the operator can layer on top; the book leaves the decision open.
The four backends
Each backend is one file in
src/backends/.
Representative shapes:
AWS
// src/backends/aws.rs
pub struct AwsBackend {
ec2: aws_sdk_ec2::Client,
cfg: AwsBootConfig,
}
#[async_trait]
impl Backend for AwsBackend {
fn name(&self) -> &'static str { "aws" }
async fn capabilities(&self) -> anyhow::Result<Capabilities> {
Ok(Capabilities {
regions: self.cfg.regions.clone(),
tdx_capable: false, // pending AWS Nitro-TDX GA
max_cpu_millicores: self.cfg.max_concurrent_instances
* cores_per_family(&self.cfg.instance_families),
max_ram_mib: self.cfg.max_concurrent_instances
* ram_per_family(&self.cfg.instance_families),
})
}
fn can_satisfy(&self, grant: &ComputeGrant<'_>) -> bool {
!grant.tdx_required
&& self.cfg.regions.contains(&grant.region)
}
async fn provision(
&self,
grant: &ComputeGrant<'_>,
envelope: &Envelope,
) -> anyhow::Result<ProvisionedInstance> {
// 1. Pick an instance family that fits.
// 2. RunInstances with cloud-init fetching the
// image pointer and verifying the hash.
// 3. Wait for IP; poll ssh readiness.
// 4. Build ProvisionedInstance with per-grant
// SSH key.
todo!("AwsBackend::provision — see crate source")
}
// … watch_until_exit, terminate
}
GCP
Similar shape; the TDX difference is in the flag:
// src/backends/gcp.rs
#[async_trait]
impl Backend for GcpBackend {
fn name(&self) -> &'static str { "gcp" }
async fn capabilities(&self) -> anyhow::Result<Capabilities> {
Ok(Capabilities {
regions: self.cfg.regions.clone(),
tdx_capable: !self.cfg.tdx_machine_types.is_empty(),
max_cpu_millicores: self.cfg.max_concurrent_instances
* cores_per_machine_family(
&self.cfg.machine_families,
),
max_ram_mib: self.cfg.max_concurrent_instances
* ram_per_machine_family(
&self.cfg.machine_families,
),
})
}
fn can_satisfy(&self, grant: &ComputeGrant<'_>) -> bool {
if grant.tdx_required && self.cfg.tdx_machine_types.is_empty() {
return false;
}
self.cfg.regions.contains(&grant.region)
}
// …
}
Azure
Mirror of GCP. Confidential Computing v3 SKUs
(Standard_DCadsv5, Standard_ECadsv5) are the
TDX-capable family; non-TDX SKUs are the default.
Bare-metal
The most distinct backend. No cloud API. The
bridge keeps SSH root sessions to operator-owned
hosts and provisions workloads with systemd-run
(bare VMs) or virsh plus qemu-tdx (bare-TDX
hosts with nested guests).
// src/backends/baremetal.rs
pub struct BareMetalBackend {
machines: Vec<BareMetalMachine>,
sessions: DashMap<String, russh::Session>,
}
#[async_trait]
impl Backend for BareMetalBackend {
fn name(&self) -> &'static str { "baremetal" }
async fn capabilities(&self) -> anyhow::Result<Capabilities> {
let regions: Vec<_> = self.machines.iter()
.map(|m| m.region.clone())
.collect::<HashSet<_>>().into_iter().collect();
let tdx_capable = self.machines.iter()
.any(|m| m.tdx_capable);
Ok(Capabilities {
regions,
tdx_capable,
max_cpu_millicores: self.machines.iter()
.map(|m| m.cpu_millicores).sum(),
max_ram_mib: self.machines.iter()
.map(|m| m.ram_mib).sum(),
})
}
// …
}
Bare-metal covers two recurring cases. One is TDX hardware not yet available on cloud: a bare-TDX host with nested-guest attestation is the path for operators running early-access TDX hardware or jurisdictions where cloud TDX is not generally available. The other is amortised hosting cost: operators who already run hardware (a colocation rack, a lab, a private datacentre) can serve TDX- required workloads without adding a cloud dependency.
The capability union
Fleet::capabilities() aggregates every enabled
backend:
pub async fn capabilities(&self)
-> anyhow::Result<Vec<(String, Capabilities)>>
{
let mut out = Vec::with_capacity(self.backends.len());
for b in self.backends.iter() {
let caps = b.capabilities().await?;
out.push((b.name().to_string(), caps));
}
Ok(out)
}
The provider card folds this into a single capability summary. Requesters see the union of regions, a TDX-capable flag that is true if any backend is TDX-capable, and CPU/RAM maxes summed across backends. One bridge identity competes with cloud-only or bare-metal-only providers because the card promises whatever the union promises, regardless of which backend ends up satisfying the grant.
Honest limits
The Backend trait cannot make AWS honest about
its real instance availability. capabilities()
reports declared caps — the operator’s
max_concurrent_instances tallied — not a live
cloud query. Grants that land during cloud-side
exhaustion fail to provision; the Compute module
scores the bridge down via the missing
ComputeLog.
A bridge running AWS and GCP does not share credentials across backends. Each backend’s config holds distinct credentials with distinct scope; a compromise of one does not bleed into the others (modulo operator hygiene).
Cloud RunInstances / Insert /
CreateOrUpdate calls add tens of seconds to the
grant-to-SSH-ready path. Bare-metal’s pre-
established SSH sessions are the fastest. Latency
sensitivity is a request-side policy.
A grant provisioned on AWS cannot migrate
transparently to GCP mid-run if AWS has a region
outage. The bridge terminates the AWS instance,
emits the partial ComputeLog, and the requester
resubmits; the next grant lands on a different
backend if AWS remains out.
Adding a new backend
Three steps to add a fifth backend class:
- Add a
src/backends/<name>.rsimplementingBackend. - Add a
[<name>]section to the boot TOML’s backends deserialise target. - Append the new backend to
Fleet::from_boot_config.
The bridge organism’s Config.content folds which
backends the binary drives, so the backend set is
part of the image’s measured identity; a bridge
cannot quietly add a backend the requester did not
expect.