Island

Architecture of the Archipelag.io Island software — the Rust service that executes Cargos on Islands

Island

The Island software is a Rust service that runs on Islands. It registers with the coordinator, receives job assignments via NATS, executes Cargos in sandboxed containers or WASM, and streams results back in real time.

Module Map

node-agent/src/
├── main.rs           # CLI entry point, tracing setup, mode dispatch
├── agent.rs          # Agent struct, main event loop, job execution
├── config.rs         # Configuration loading (TOML + env)
├── nats.rs           # NATS client, subjects, message types
├── messages.rs       # Workload I/O protocol (ChatInput, WorkloadOutput)
├── docker.rs         # Container execution, sandbox tiers, seccomp
├── wasm.rs           # WASM execution via Wasmtime
├── executor.rs       # Test job helper (--test-job mode)
├── cache.rs          # Image caching, warm Cargo tracking
├── state.rs          # Persistent agent state, WASM module cache
├── model_cache.rs    # ML model download, HF URI resolution, LRU eviction
├── preload.rs        # Hardware-aware startup model preloading
├── onnx.rs           # Native ONNX inference (feature-gated: onnx)
├── gguf.rs           # Native GGUF/llama.cpp inference (feature-gated: gguf)
├── diffusers.rs      # Native Stable Diffusion via candle (feature-gated: diffusers)
├── metrics/
│   ├── mod.rs        # System metrics (CPU, RAM, disk, network)
│   ├── gpu.rs        # GPU metrics via nvidia-smi
│   └── container.rs  # Per-container resource usage
├── security/
│   ├── mod.rs        # Security module re-exports
│   ├── registry.rs   # Registry allowlist enforcement
│   ├── seccomp.rs    # Seccomp profiles (Minimal/Default/GPU/Network)
│   ├── signing.rs    # Cosign signature verification
│   └── tls.rs        # TLS certificate pinning (stubbed)
└── update/
    ├── mod.rs        # Update checker (polls coordinator)
    ├── download.rs   # Binary download with progress tracking
    ├── verify.rs     # Ed25519 + SHA256 binary verification
    └── restart.rs    # Graceful restart with backup and rollback

Agent Lifecycle

The Agent struct is the central orchestrator. On startup it:

Generates or loads a host ID — persisted to ~/.archipelag/state.json
Connects to Docker — via local socket (/var/run/docker.sock)
Initializes the cache manager — scans existing Docker images, optionally pre-pulls configured images
Loads signing keys — fetches trusted public keys from the coordinator, falls back to disk cache
Configures the registry allowlist — defaults to ghcr.io/archipelag-io, docker.io/archipelag, docker.io/library
Connects to NATS — with infinite reconnect and retry on initial connect
Registers with the coordinator — sends capabilities (CPU, RAM, GPU, platform)
Enters the main event loop

Main Event Loop

The agent runs a tokio::select! loop that handles multiple concurrent concerns:

┌──────────────────────────────────────────────────┐
│                  Agent Event Loop                │
│                                                  │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────┐ │
│  │  Heartbeat   │  │ Job Assign   │  │ Cancel  │ │
│  │  (10s tick)  │  │ (NATS sub)   │  │ (NATS)  │ │
│  └──────┬───────┘  └──────┬───────┘  └────┬────┘ │
│         │                 │               │      │
│  ┌──────┴───────┐  ┌──────┴───────┐       │      │
│  │ Cache Cleanup│  │ Job Complete │       │      │
│  │  (5 min)     │  │  (mpsc rx)   │       │      │
│  └──────────────┘  └──────────────┘       │      │
│                                           │      │
│  ┌──────────────┐  ┌──────────────┐       │      │
│  │ Update Check │  │   Ctrl+C     │       │      │
│  │  (30 min)    │  │  (shutdown)  │       │      │
│  └──────────────┘  └──────────────┘       │      │
└──────────────────────────────────────────────────┘

Event	Interval	Action
Heartbeat	10 seconds	Sends enhanced heartbeat with system/GPU/cache metrics; reconnects after 3 failures
Job assignment	On message	Spawns job execution task, tracks in `active_jobs` counter
Job completion	On channel	Decrements `active_jobs`, publishes final status
Cancel request	On message	Signals cancellation via `watch` channel to running job
Cache cleanup	5 minutes	Evicts LRU images, removes stale warm Cargo entries
Update check	30 minutes	Polls coordinator for new agent version
Shutdown	Ctrl+C	Sets shutdown flag, waits for active jobs

Heartbeat recovery

If heartbeat publishing fails 3 times consecutively, the agent resubscribes to NATS using core subscriptions with exponential backoff (capped at 30 seconds) plus jitter.

Job Execution Pipeline

When a job assignment arrives, the agent follows this pipeline:

Job received (NATS)
        │
        ▼
  Parse AssignJob message
        │
        ▼
  Publish "started" status
        │
        ├── runtime_type == "wasm"
        │         │
        │         ▼
        │   Download/cache WASM ──► Verify SHA256 hash
        │         │
        │         ▼
        │   WasmExecutor.run()
        │         │
        │         ▼
        │   Process output stream
        │
        ├── runtime_type == "container" (default)
        │         │
        │         ▼
        │   Registry allowlist check ──► Reject if not allowed
        │         │
        │         ▼
        │   Cache: ensure_image() ──► Pull if not cached
        │         │
        │         ▼
        │   Verify signature (cosign) ──► Skip/fail per config
        │         │
        │         ▼
        │   Apply sandbox tier ──► Set resource limits + seccomp
        │         │
        │         ▼
        │   run_verified_container()
        │         │
        │         ▼
        │   Stream output (JSON Lines)
        │
        ▼
  Publish final status (succeeded/failed)

During execution, the agent also handles:

Lease renewal every 30 seconds (60-second extensions, up to 5 total)
Cancel requests via a watch channel — stops the container/WASM
Output streaming — each JSON line is parsed and published to NATS

Container Execution

Containers are executed via the Bollard Docker client with security controls applied per sandbox tier.

ContainerConfig

Field	Default	Purpose
`image`	—	Container image reference
`input`	—	JSON input passed via stdin
`gpu_devices`	none	GPU device passthrough (e.g., `all`)
`timeout_seconds`	300	Maximum execution time
`memory_bytes`	8 GB	Memory limit
`read_only_rootfs`	true	Immutable root filesystem
`tmpfs_mounts`	256 MB at `/tmp`	Writable temporary storage
`cpu_quota`	none	CPU time limit (microseconds per 100ms)
`network_disabled`	true	No network access by default
`sandbox_tier`	`standard`	Security tier (see below)
`seccomp_profile`	auto	Seccomp syscall filter (derived from tier)

Sandbox Tiers

Tier	Memory	Timeout	Network	CPUs	Seccomp Profile
`restricted`	256 MB	60s	Off	1	Minimal (~10 syscalls)
`standard`	1 GB	300s	Off	2	Default (~140 syscalls)
`elevated`	8 GB	600s	On	4	GPU or Network profile

Seccomp Profiles

The Island software ships four built-in seccomp profiles:

Minimal — Only read, write, close, exit, brk, mmap, munmap, rt_sigreturn, futex, exit_group
Default — ~140 syscalls across 9 categories: file ops, process ops, memory, signals, time, scheduler, I/O multiplexing, threading, and safe miscellaneous calls
GPU — Default profile plus ioctl and mmap rules for CUDA
Network — Default profile plus 18 network syscalls (socket, bind, listen, connect, etc.)

WASM Execution

WASM Cargos run in a Wasmtime sandbox with strict resource limits:

Limit	Default	Purpose
Max memory	256 MB	Prevents memory exhaustion
Max fuel	100 billion	Instruction count limit (prevents infinite loops)
Timeout	60 seconds	Wall-clock execution limit

The executor verifies SHA256 hashes before running modules and caches them in ~/.archipelag/wasm-cache/. WASI is used for stdio — input is provided via a memory pipe, output is captured and streamed.

Output Streaming Format

Cargos communicate with the agent via JSON Lines on stdout. Each line is a WorkloadOutput message:

{"type": "status", "message": "Loading model..."}
{"type": "token", "content": "Hello"}
{"type": "token", "content": " world"}
{"type": "progress", "step": 3, "total": 20}
{"type": "image", "data": "<base64>", "format": "png", "width": 512, "height": 512}
{"type": "done", "usage": {"prompt_tokens": 42, "completion_tokens": 128}}
{"type": "error", "message": "Out of memory"}

Type	Purpose
`status`	Informational messages (model loading, warmup)
`token`	Individual text tokens for LLM streaming
`progress`	Step progress for multi-step Cargos (e.g., image diffusion)
`image`	Completed image data (base64-encoded)
`done`	Job complete with optional usage statistics
`error`	Error message (job will be marked failed)

Image Caching

The CacheManager tracks Docker images and warm Cargos to optimize job startup:

Image cache — Scans Docker images on startup, tracks last_used and use_count
Warm Cargos — Tracks recently-run Cargo IDs for affinity-based placement
LRU eviction — Removes tracking entries when max_cached_images (default 20) is exceeded
Stale cleanup — Warm Cargo entries expire after warm_ttl_seconds (default 3600)
Warmth score — 0-100 composite score reported in heartbeats (warm Cargo +50, cached image +30, use count up to +20)

The cache manager reports stats in enhanced heartbeats so the coordinator can prefer Islands with warm containers for a given Cargo.

Security Features

Registry Allowlist

Before pulling any container image, the Island software checks the image reference against a configurable allowlist. Default allowed registries:

ghcr.io/archipelag-io
docker.io/archipelag
docker.io/library

Images from unrecognized registries are rejected. The allowlist can also require digest pinning (@sha256:...) for additional integrity.

Signature Verification

For trust level 2+ Cargos, the Island software verifies cosign signatures using trusted public keys fetched from the coordinator. Keys are cached locally for offline resilience. If verification fails and require_signature is set, the job is rejected.

Self-Update

The Island software checks for updates every 30 minutes. The update flow:

Query coordinator for latest version
Download new binary with progress tracking
Verify Ed25519 signature and SHA256 checksum
Wait for active jobs to complete (5 min normal, 30s for critical updates)
Back up current binary to ~/.archipelag/backups/
Replace binary and exec into new version

Next Steps

{% card(title="Coordinator", href="/architecture/coordinator/") %} Learn how the coordinator orchestrates Islands and dispatches jobs.

Cargos

Understand the Cargo model — runtime types, trust levels, and I/O protocol.

For Islands

Get started running the Island software on your machine.

{% end %}