Island
Architecture of the Archipelag.io Island software — the Rust service that executes Cargos on Islands
Island
The Island software is a Rust service that runs on Islands. It registers with the coordinator, receives job assignments via NATS, executes Cargos in sandboxed containers or WASM, and streams results back in real time.
Module Map
node-agent/src/
├── main.rs # CLI entry point, tracing setup, mode dispatch
├── agent.rs # Agent struct, main event loop, job execution
├── config.rs # Configuration loading (TOML + env)
├── nats.rs # NATS client, subjects, message types
├── messages.rs # Workload I/O protocol (ChatInput, WorkloadOutput)
├── docker.rs # Container execution, sandbox tiers, seccomp
├── wasm.rs # WASM execution via Wasmtime
├── executor.rs # Test job helper (--test-job mode)
├── cache.rs # Image caching, warm Cargo tracking
├── state.rs # Persistent agent state, WASM module cache
├── model_cache.rs # ML model download, HF URI resolution, LRU eviction
├── preload.rs # Hardware-aware startup model preloading
├── onnx.rs # Native ONNX inference (feature-gated: onnx)
├── gguf.rs # Native GGUF/llama.cpp inference (feature-gated: gguf)
├── diffusers.rs # Native Stable Diffusion via candle (feature-gated: diffusers)
├── metrics/
│ ├── mod.rs # System metrics (CPU, RAM, disk, network)
│ ├── gpu.rs # GPU metrics via nvidia-smi
│ └── container.rs # Per-container resource usage
├── security/
│ ├── mod.rs # Security module re-exports
│ ├── registry.rs # Registry allowlist enforcement
│ ├── seccomp.rs # Seccomp profiles (Minimal/Default/GPU/Network)
│ ├── signing.rs # Cosign signature verification
│ └── tls.rs # TLS certificate pinning (stubbed)
└── update/
├── mod.rs # Update checker (polls coordinator)
├── download.rs # Binary download with progress tracking
├── verify.rs # Ed25519 + SHA256 binary verification
└── restart.rs # Graceful restart with backup and rollback
Agent Lifecycle
The Agent struct is the central orchestrator. On startup it:
- Generates or loads a host ID — persisted to
~/.archipelag/state.json - Connects to Docker — via local socket (
/var/run/docker.sock) - Initializes the cache manager — scans existing Docker images, optionally pre-pulls configured images
- Loads signing keys — fetches trusted public keys from the coordinator, falls back to disk cache
- Configures the registry allowlist — defaults to
ghcr.io/archipelag-io,docker.io/archipelag,docker.io/library - Connects to NATS — with infinite reconnect and retry on initial connect
- Registers with the coordinator — sends capabilities (CPU, RAM, GPU, platform)
- Enters the main event loop
Main Event Loop
The agent runs a tokio::select! loop that handles multiple concurrent concerns:
┌──────────────────────────────────────────────────┐
│ Agent Event Loop │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │ Heartbeat │ │ Job Assign │ │ Cancel │ │
│ │ (10s tick) │ │ (NATS sub) │ │ (NATS) │ │
│ └──────┬───────┘ └──────┬───────┘ └────┬────┘ │
│ │ │ │ │
│ ┌──────┴───────┐ ┌──────┴───────┐ │ │
│ │ Cache Cleanup│ │ Job Complete │ │ │
│ │ (5 min) │ │ (mpsc rx) │ │ │
│ └──────────────┘ └──────────────┘ │ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ Update Check │ │ Ctrl+C │ │ │
│ │ (30 min) │ │ (shutdown) │ │ │
│ └──────────────┘ └──────────────┘ │ │
└──────────────────────────────────────────────────┘
| Event | Interval | Action |
|---|---|---|
| Heartbeat | 10 seconds | Sends enhanced heartbeat with system/GPU/cache metrics; reconnects after 3 failures |
| Job assignment | On message | Spawns job execution task, tracks in active_jobs counter |
| Job completion | On channel | Decrements active_jobs, publishes final status |
| Cancel request | On message | Signals cancellation via watch channel to running job |
| Cache cleanup | 5 minutes | Evicts LRU images, removes stale warm Cargo entries |
| Update check | 30 minutes | Polls coordinator for new agent version |
| Shutdown | Ctrl+C | Sets shutdown flag, waits for active jobs |
Job Execution Pipeline
When a job assignment arrives, the agent follows this pipeline:
Job received (NATS)
│
▼
Parse AssignJob message
│
▼
Publish "started" status
│
├── runtime_type == "wasm"
│ │
│ ▼
│ Download/cache WASM ──► Verify SHA256 hash
│ │
│ ▼
│ WasmExecutor.run()
│ │
│ ▼
│ Process output stream
│
├── runtime_type == "container" (default)
│ │
│ ▼
│ Registry allowlist check ──► Reject if not allowed
│ │
│ ▼
│ Cache: ensure_image() ──► Pull if not cached
│ │
│ ▼
│ Verify signature (cosign) ──► Skip/fail per config
│ │
│ ▼
│ Apply sandbox tier ──► Set resource limits + seccomp
│ │
│ ▼
│ run_verified_container()
│ │
│ ▼
│ Stream output (JSON Lines)
│
▼
Publish final status (succeeded/failed)
During execution, the agent also handles:
- Lease renewal every 30 seconds (60-second extensions, up to 5 total)
- Cancel requests via a
watchchannel — stops the container/WASM - Output streaming — each JSON line is parsed and published to NATS
Container Execution
Containers are executed via the Bollard Docker client with security controls applied per sandbox tier.
ContainerConfig
| Field | Default | Purpose |
|---|---|---|
image | — | Container image reference |
input | — | JSON input passed via stdin |
gpu_devices | none | GPU device passthrough (e.g., all) |
timeout_seconds | 300 | Maximum execution time |
memory_bytes | 8 GB | Memory limit |
read_only_rootfs | true | Immutable root filesystem |
tmpfs_mounts | 256 MB at /tmp | Writable temporary storage |
cpu_quota | none | CPU time limit (microseconds per 100ms) |
network_disabled | true | No network access by default |
sandbox_tier | standard | Security tier (see below) |
seccomp_profile | auto | Seccomp syscall filter (derived from tier) |
Sandbox Tiers
| Tier | Memory | Timeout | Network | CPUs | Seccomp Profile |
|---|---|---|---|---|---|
restricted | 256 MB | 60s | Off | 1 | Minimal (~10 syscalls) |
standard | 1 GB | 300s | Off | 2 | Default (~140 syscalls) |
elevated | 8 GB | 600s | On | 4 | GPU or Network profile |
Seccomp Profiles
The Island software ships four built-in seccomp profiles:
- Minimal — Only
read,write,close,exit,brk,mmap,munmap,rt_sigreturn,futex,exit_group - Default — ~140 syscalls across 9 categories: file ops, process ops, memory, signals, time, scheduler, I/O multiplexing, threading, and safe miscellaneous calls
- GPU — Default profile plus
ioctlandmmaprules for CUDA - Network — Default profile plus 18 network syscalls (socket, bind, listen, connect, etc.)
WASM Execution
WASM Cargos run in a Wasmtime sandbox with strict resource limits:
| Limit | Default | Purpose |
|---|---|---|
| Max memory | 256 MB | Prevents memory exhaustion |
| Max fuel | 100 billion | Instruction count limit (prevents infinite loops) |
| Timeout | 60 seconds | Wall-clock execution limit |
The executor verifies SHA256 hashes before running modules and caches them in ~/.archipelag/wasm-cache/. WASI is used for stdio — input is provided via a memory pipe, output is captured and streamed.
Output Streaming Format
Cargos communicate with the agent via JSON Lines on stdout. Each line is a WorkloadOutput message:
{"type": "status", "message": "Loading model..."}
{"type": "token", "content": "Hello"}
{"type": "token", "content": " world"}
{"type": "progress", "step": 3, "total": 20}
{"type": "image", "data": "<base64>", "format": "png", "width": 512, "height": 512}
{"type": "done", "usage": {"prompt_tokens": 42, "completion_tokens": 128}}
{"type": "error", "message": "Out of memory"}
| Type | Purpose |
|---|---|
status | Informational messages (model loading, warmup) |
token | Individual text tokens for LLM streaming |
progress | Step progress for multi-step Cargos (e.g., image diffusion) |
image | Completed image data (base64-encoded) |
done | Job complete with optional usage statistics |
error | Error message (job will be marked failed) |
Image Caching
The CacheManager tracks Docker images and warm Cargos to optimize job startup:
- Image cache — Scans Docker images on startup, tracks
last_usedanduse_count - Warm Cargos — Tracks recently-run Cargo IDs for affinity-based placement
- LRU eviction — Removes tracking entries when
max_cached_images(default 20) is exceeded - Stale cleanup — Warm Cargo entries expire after
warm_ttl_seconds(default 3600) - Warmth score — 0-100 composite score reported in heartbeats (warm Cargo +50, cached image +30, use count up to +20)
The cache manager reports stats in enhanced heartbeats so the coordinator can prefer Islands with warm containers for a given Cargo.
Security Features
Registry Allowlist
Before pulling any container image, the Island software checks the image reference against a configurable allowlist. Default allowed registries:
ghcr.io/archipelag-iodocker.io/archipelagdocker.io/library
Images from unrecognized registries are rejected. The allowlist can also require digest pinning (@sha256:...) for additional integrity.
Signature Verification
For trust level 2+ Cargos, the Island software verifies cosign signatures using trusted public keys fetched from the coordinator. Keys are cached locally for offline resilience. If verification fails and require_signature is set, the job is rejected.
Self-Update
The Island software checks for updates every 30 minutes. The update flow:
- Query coordinator for latest version
- Download new binary with progress tracking
- Verify Ed25519 signature and SHA256 checksum
- Wait for active jobs to complete (5 min normal, 30s for critical updates)
- Back up current binary to
~/.archipelag/backups/ - Replace binary and exec into new version
Next Steps
Cargos
Understand the Cargo model — runtime types, trust levels, and I/O protocol.
For Islands
Get started running the Island software on your machine.
{% end %}
