Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Running a confluence

audience: operators

Operating a confluence is, at the protocol level, identical to operating any mosaik-native committee organism: a committee, a state machine, a narrow public surface. This page is the delta — what is new because the confluence spans multiple citizens (lattices, standalone organisms, or a mix).

Modules (Atlas, Almanac, Chronicle) are coalition-scoped organisms; the same rules apply. Each module’s crate publishes its own page once commissioned.

For the base runbook — systemd units, dashboards, incident response — follow the builder operator runbooks. Each per-confluence crate publishes its own page once commissioned.

What to add on top of a single-committee runbook

  • Per-citizen subscription health. A confluence reads from multiple citizens’ public surfaces. The dashboard carries one subscription-lag metric per spanned citizen, not an aggregated metric. A confluence that stalls typically stalls on one citizen first.

  • Per-citizen ticket health. Committee members hold tickets from each spanned citizen’s operator. When a per-citizen operator rotates their ticket-issuance root, the confluence’s bonds into that citizen break until new tickets are issued to the committee. Monitor ticket validity horizons per citizen.

  • Citizen-identity monitoring. When a referenced citizen retires (stable id changes) or bumps its content hash (if the confluence pinned content), the confluence’s own Config fingerprint is stale; the confluence must be redeployed under an updated ConfluenceConfig. Detect this before integrators do.

  • Cross-operator communications. A confluence with citizens run by different operators requires a standing channel with each — at minimum a mailing list or chat channel for advance announcements of retirements and rotations.

The lifecycle of a confluence commit

  1. Committee driver watches each spanned citizen’s public collection / stream.
  2. An upstream event fires; the driver wraps it in an Observe* command with an evidence pointer back to the upstream commit.
  3. The confluence’s Group commits the Observe* via Raft.
  4. Periodically (or on an apply-deadline timer), the driver issues an Apply command. The state machine reads accumulated observations and commits the confluence’s own fact.
  5. The confluence’s public surface serves the committed fact to integrators and to any downstream consumers (other confluences, other organisms).

Every step is standard mosaik machinery. The confluence- specific work is the driver’s multi-subscription logic.

Rotations

A confluence rotates like any other organism:

  • Committee member rotation. Add a new member under the same ConfluenceConfig; drain the old member; decommission. No fingerprint change.
  • Committee admission policy rotation (e.g. an MR_TD bump). Requires a new ConfluenceConfig fingerprint. Announce to any coalition operators referencing the confluence and follow the rotations and upgrades sequence.
  • Spanned-citizen-set rotation. Adding or removing a spanned citizen changes the confluence’s content fingerprint. This is a larger change; typically accompanied by a fresh ConfluenceConfig publication and a notice to referencing coalitions.

Rotations that do NOT break integrators

  • Committee member swaps under a stable ConfluenceConfig. Integrators see a brief latency bump during drain; no handle failures.

Rotations that DO break integrators

  • Any ConfluenceConfig fingerprint bump. Integrators compiled against the old config see ConnectTimeout on the confluence handle until they recompile against the new ConfluenceConfig (and any coalition referencing it updates its CoalitionConfig). Announce ahead of time via the change channel.

Retirement

When a confluence’s committee is shutting down permanently, the committee emits a RetirementMarker as its final commit on each public primitive. The marker carries:

  • effective_at — the Almanac tick or wall-clock at which the committee ceases to commit;
  • replacement — an optional pointer to the replacement confluence so integrators rebind cleanly rather than timing out.

If any referencing coalition ships a Chronicle, the retirement lands as ChronicleKind::ConfluenceRetired in the next Chronicle entry.

Incident response specific to confluences

Two incident classes to add to your playbook on top of the single-organism classes in builder incident response.

Evidence-pointer resolution fails on replay

Symptom: a committee member replaying the log fails to resolve an evidence pointer to an upstream citizen commit. Cause: the upstream citizen committed the fact, the confluence observed it, but the citizen has since gone through a state compaction / reorg that removed the referenced commit from the public surface the confluence reads.

Response:

  • The confluence state machine is required to reject such replays, not tolerate them.
  • Confirm the issue is upstream-citizen retention, not confluence state.
  • Coordinate with the per-citizen operator. The fix is usually citizen-side configuration: longer retention on the public surface the confluence subscribes to.

One spanned citizen goes dark

Symptom: no events from one of the spanned citizens’ public surfaces for multiple slots (or multiple publish ticks, if the citizen is an organism without a slot clock).

Response:

  • Confirm with the per-citizen operator whether the outage is on their side or on the subscription.
  • If on their side, fall back to the stall policy. A confluence committing partial evidence yields degraded commits; one stalling per slot yields no commits until the citizen returns. Integrators were warned in the composition-hooks doc.
  • If on the subscription side: mosaik transport troubleshooting; nothing confluence-specific.

Cross-references