Back to all posts
9 min read

Blackship Architecture: State Machines, Dependency Graphs, and Resilience Patterns

Blackship Architecture: State Machines, Dependency Graphs, and Resilience Patterns

Most jail managers track state with flags. is_running: bool. Maybe a PID file. When something goes wrong, you’re left guessing: Is it starting? Stopping? Half-crashed? Should I kill it?

Blackship takes a different approach. Every jail is a state machine with explicit transitions. Every startup sequence respects a dependency graph. Every restart uses circuit breakers with exponential backoff. Here’s how it works.

The Jail State Machine

stateDiagram-v2
    [*] --> Stopped
    Stopped --> Starting: start()
    Starting --> Running: started()
    Running --> Stopping: stop()
    Stopping --> Stopped: stopped()

    Starting --> Failed: fail()
    Running --> Failed: fail()
    Stopping --> Failed: fail()

    Failed --> Stopped: recover()

Five states. Six events. Every transition is explicit.

Why This Matters

Consider what happens when you run blackship up web:

  1. Stopped → Starting: The start() event triggers. Hooks with phase = "pre_start" execute.
  2. Starting → Running: After jail creation succeeds, started() fires. Hooks with phase = "post_start" execute.
  3. If anything fails: The fail() event moves the jail to Failed state. No ambiguity.

Compare to flag-based systems:

# Typical approach
jail.is_running = True
jail.start()  # What if this fails halfway?
# Now is_running is True but the jail isn't actually running

With explicit state machines, invalid transitions are rejected:

blackship> stop web
Error: Cannot stop jail 'web' - current state is Stopped (expected Running)

You can’t stop something that isn’t running. You can’t start something that’s already starting. The state machine enforces this.

Dynamic Dispatch Mode

The state machine uses dynamic dispatch at runtime, allowing external events (health check failures, manual commands, supervisor signals) to drive transitions. This is the difference between:

  • Compile-time state machines: Good for protocols with fixed sequences
  • Runtime state machines: Good for reactive systems where external events arrive unpredictably

Jails are reactive. A jail can fail at any moment. A user can stop it at any moment. Health checks run on intervals. Dynamic dispatch handles all of this.

Dependency Graphs with Topological Ordering

When jails depend on each other, order matters. You can’t start your app before the database. You shouldn’t stop the database while the app is using it.

Blackship uses petgraph to build a directed acyclic graph (DAG) of jail dependencies:

[[jails]]
name = "app"
depends_on = ["cache", "database"]

[[jails]]
name = "cache"
depends_on = ["database"]

[[jails]]
name = "database"

This creates:

graph LR
    database --> cache
    database --> app
    cache --> app

Startup Order (Topological Sort)

When you run blackship up app, the dependency graph is walked:

  1. Find all transitive dependencies of app
  2. Topologically sort them
  3. Start in order: database → cache → app

Each jail waits for its dependencies to reach Running state before starting.

Shutdown Order (Reverse Topological Sort)

When you run blackship down app:

  1. Find all jails that depend on app (reverse dependencies)
  2. Topologically sort (reversed)
  3. Stop in order: app → cache → database

This ensures nothing is stopped while something else depends on it.

Cycle Detection

Circular dependencies are caught at config validation:

[[jails]]
name = "a"
depends_on = ["b"]

[[jails]]
name = "b"
depends_on = ["a"]  # Error: Cycle detected
blackship check
Error: Dependency cycle detected: a → b → a

The Warden: Resilience Through Circuit Breakers

When a jail crashes, the naive approach is: restart immediately. Forever.

This creates restart loops. The jail crashes, restarts, crashes in 100ms, restarts, crashes, restarts… CPU spins. Logs fill up. Nothing improves.

The Warden (Blackship’s supervisor) implements three resilience patterns:

1. Exponential Backoff

Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds
...
Attempt N: Wait min(2^N, 60) seconds

With jitter (±50%) to prevent thundering herd if multiple jails fail simultaneously.

2. Circuit Breaker

After 5 consecutive failures, the circuit opens:

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: 5 failures
    OPEN --> HALF_OPEN: 5 minutes timeout
    HALF_OPEN --> CLOSED: success
    HALF_OPEN --> OPEN: failure

    note right of CLOSED: Normal operation\nRestarts allowed
    note right of OPEN: No restarts\nWaiting for timeout
    note right of HALF_OPEN: Test one restart

When the circuit is open, no restart attempts are made. This prevents wasting resources on a jail that clearly can’t run.

After 5 minutes, the circuit moves to half-open. One restart attempt is made. If it succeeds, we’re back to normal. If it fails, the circuit opens again.

3. Per-Jail State Tracking

Each jail has its own:

  • Attempt counter
  • Backoff calculator
  • Circuit breaker

A failing Redis jail doesn’t affect the PostgreSQL jail’s restart behavior. Isolation at every level.

Combining the Patterns

Jail 'web' crashes
├── Attempt 1: Wait 1.2s (jittered), restart → fails
├── Attempt 2: Wait 2.4s, restart → fails
├── Attempt 3: Wait 4.1s, restart → fails
├── Attempt 4: Wait 8.3s, restart → fails
├── Attempt 5: Wait 15.9s, restart → fails
├── Circuit OPENS (5 failures reached)
├── No restarts for 5 minutes
├── Circuit HALF-OPEN
├── Attempt 6: restart → succeeds!
├── Circuit CLOSED, attempt counter reset
└── Normal operation

Lifecycle Hooks: Extensibility Without Complexity

Hooks run at defined phases. Each hook specifies:

  • Phase: When to run (pre_start, post_start, pre_stop, post_stop, etc.)
  • Target: Where to run (host or jail)
  • Command: What to run
  • On Failure: What to do if it fails (abort or continue)
[[jails.hooks]]
phase = "post_start"
target = "jail"
command = "/etc/rc.d/nginx start"
on_failure = "abort"

[[jails.hooks]]
phase = "pre_stop"
target = "jail"
command = "/etc/rc.d/nginx stop"
on_failure = "continue"

Execution Flow

sequenceDiagram
    participant CLI as blackship up
    participant SM as State Machine
    participant Hooks as Hook Runner
    participant Jail as Jail FFI
    participant Net as Network

    CLI->>SM: start()
    SM->>SM: Stopped → Starting
    SM->>Hooks: pre_start (host)
    Hooks-->>SM: ok
    SM->>Jail: jail_set()
    Jail-->>SM: jid
    SM->>Hooks: post_create
    Hooks-->>SM: ok
    SM->>Net: setup VNET
    Net-->>SM: ok
    SM->>Hooks: pre_start (jail)
    Hooks-->>SM: ok
    SM->>SM: started()
    SM->>SM: Starting → Running
    SM->>Hooks: post_start (jail)
    Hooks-->>SM: ok
    SM-->>CLI: Running

If any hook with on_failure = "abort" fails, the entire operation aborts and the jail transitions to Failed.

Variable Substitution

Hooks support variable substitution:

command = "/path/to/script --jail ${JAIL_NAME} --path ${JAIL_PATH}"

Available variables: JAIL_NAME, JAIL_PATH, JAIL_IP, JAIL_HOSTNAME, custom environment variables.

ZFS Integration: Not Bolted On

ZFS isn’t an afterthought. The entire data model assumes ZFS:

zroot/blackship/
├── jails/
│   ├── web/
│   ├── postgres/
│   └── redis/
├── releases/
│   └── 15.0-RELEASE/
└── cache/

Snapshots as First-Class Operations

blackship snapshot create web pre-upgrade

This creates zroot/blackship/jails/web@pre-upgrade. Atomic. Consistent. No tar.gz nonsense.

Clones for Testing

blackship clone web@pre-upgrade web-test

This creates zroot/blackship/jails/web-test as a clone of the snapshot. Copy-on-write. Instant. Uses almost no additional disk space until you make changes.

Export with ZFS Send

blackship export web -o backup.zfs --zfs-send

Uses zfs send to create a stream. Faster than tar. Preserves all ZFS properties.

Import with ZFS Receive

blackship import backup.zfs --name web-restored

Auto-detects format (tar.zst or ZFS stream) and handles appropriately.

VNET Networking Architecture

graph TB
    subgraph Host["Host System"]
        bridge["blackship0 Bridge<br/>gateway: 10.0.1.1"]
        epair0a["epair0a"]
        epair1a["epair1a"]
    end

    subgraph web["Jail: web"]
        epair0b["epair0b<br/>10.0.1.10"]
    end

    subgraph db["Jail: db"]
        epair1b["epair1b<br/>10.0.1.11"]
    end

    bridge --- epair0a
    bridge --- epair1a
    epair0a <--> epair0b
    epair1a <--> epair1b

Each jail gets:

  • An epair interface (virtual ethernet pair)
  • One end attached to the bridge (host-side)
  • One end inside the jail
  • Static IP on the jail-side interface
  • Gateway pointing to the bridge IP

PF Integration via Anchors

Port forwarding uses PF anchors to avoid modifying /etc/pf.conf:

# Added to /etc/pf.conf once
rdr-anchor "blackship"
anchor "blackship"

Blackship manages rules inside the anchor:

blackship expose web -p 80
# Adds: rdr pass on $ext_if proto tcp to port 80 -> 10.0.1.10 port 80

blackship expose web -p 443 -I 192.168.1.100
# Adds: rdr pass on $ext_if proto tcp from any to 192.168.1.100 port 443 -> 10.0.1.10 port 443

No manual PF editing. No config file conflicts.

The Bridge: Central Orchestrator

All operations go through the Bridge (not the network bridge - the orchestration component):

graph TB
    subgraph Bridge["Bridge (Central Orchestrator)"]
        manifest["Manifest<br/>(TOML config)"]
        network["Network<br/>Manager"]
        zfs["ZFS<br/>Manager"]
        hooks["Hook<br/>Runner"]
        health["Health<br/>Checker"]
        ffi["Jail<br/>FFI"]
    end

    subgraph Warden["Warden (Supervisor)"]
        backoff["Exponential<br/>Backoff"]
        breaker["Circuit<br/>Breaker"]
        restart["Restart<br/>Logic"]
    end

    Bridge --> Warden
    health --> Warden

The Bridge:

  1. Loads and validates the manifest (TOML config)
  2. Builds the dependency graph
  3. Coordinates with Network Manager for VNET setup
  4. Delegates to ZFS Manager for dataset operations
  5. Runs hooks at appropriate lifecycle phases
  6. Calls Jail FFI for actual jail operations
  7. Reports events to the Warden for supervision

Health Check Architecture

Health checks are command-based. Exit code determines health:

  • Exit 0: Healthy
  • Exit non-zero: Unhealthy
[[jails.healthcheck.checks]]
name = "http"
command = "curl -sf http://localhost:8080/health"
target = "jail"
interval = 30
timeout = 10
retries = 3

Execution Model

  1. Health checks run on separate threads (via crossbeam)
  2. Each check has its own timeout
  3. After retries consecutive failures, the jail is marked unhealthy
  4. Unhealthy status is reported to the Warden
  5. Warden applies restart logic with circuit breaker

Target Semantics

  • target = "host": Command runs on the host, can check external ports
  • target = "jail": Command runs inside the jail via jexec

Direct Kernel Communication

Blackship doesn’t shell out to jail(8) or ifconfig(8) for core operations. It talks directly to the kernel via jail(2) and ioctl(2) syscalls.

What this means for you:

  • Faster startup when launching multiple jails
  • No parsing command output that changes between FreeBSD versions
  • Health checks don’t spawn processes every 30 seconds

What still uses commands:

  • ZFS operations (zfs(8)) - no kernel API available
  • PF rules (pfctl(8)) - anchor-based, doesn’t touch /etc/pf.conf

Key Architectural Decisions

1. State Machines Over Flags

Flags lie. A boolean is_running doesn’t capture “starting”, “stopping”, or “failed but still has a PID”. State machines make these explicit.

2. Graphs Over Lists

Dependencies aren’t flat. A depends on B and C. B depends on C. Representing this as a graph allows proper ordering, cycle detection, and transitive dependency resolution.

3. Circuit Breakers Over Infinite Retries

Systems fail. Sometimes they can’t be fixed by restarting. Circuit breakers recognize this and stop trying, preserving resources for jails that can actually run.

4. ZFS Native Over Abstraction Layers

Many tools treat ZFS as optional. Blackship assumes ZFS for its data model. Snapshots, clones, and send/receive are first-class operations, not afterthoughts.

5. Hooks Over Magic

Instead of hardcoding nginx startup or PostgreSQL initialization, hooks let users define what happens at each lifecycle phase. Maximum flexibility, zero magic.


That’s the architecture. State machines for lifecycle. Graphs for dependencies. Circuit breakers for resilience. ZFS for storage. Hooks for extensibility.

GitHub | Full Documentation

🔗 Interstellar Communications

No transmissions detected yet. Be the first to establish contact!

• Link to this post from your site• Share your thoughts via webmention• Join the IndieWeb conversation

Related Posts

Operation: From State Hero to Zero

The surgical breakdown of a 1.6k LOC Ruby monolith into focused modules. Or: how I performed open-heart surgery on a dying codebase and lived to tell the tale.

rubyrefactoringarchitecture