Open Beta Archipelag.io is in open beta until June 2026. All credits and earnings are virtual. Read the announcement →

Island

Architecture of the Archipelag.io Island software — the Rust service that executes Cargos on Islands

Island

The Island software is a Rust service that runs on Islands. It registers with the coordinator, receives job assignments via NATS, executes Cargos in sandboxed containers or WASM, and streams results back in real time.

Module Map

node-agent/src/
├── main.rs           # CLI entry point, tracing setup, mode dispatch
├── agent.rs          # Agent struct, main event loop, job execution
├── config.rs         # Configuration loading (TOML + env)
├── nats.rs           # NATS client, subjects, message types
├── messages.rs       # Workload I/O protocol (ChatInput, WorkloadOutput)
├── docker.rs         # Container execution, sandbox tiers, seccomp
├── wasm.rs           # WASM execution via Wasmtime
├── executor.rs       # Test job helper (--test-job mode)
├── cache.rs          # Image caching, warm Cargo tracking
├── state.rs          # Persistent agent state, WASM module cache
├── model_cache.rs    # ML model download, HF URI resolution, LRU eviction
├── preload.rs        # Hardware-aware startup model preloading
├── onnx.rs           # Native ONNX inference (feature-gated: onnx)
├── gguf.rs           # Native GGUF/llama.cpp inference (feature-gated: gguf)
├── diffusers.rs      # Native Stable Diffusion via candle (feature-gated: diffusers)
├── metrics/
│   ├── mod.rs        # System metrics (CPU, RAM, disk, network)
│   ├── gpu.rs        # GPU metrics via nvidia-smi
│   └── container.rs  # Per-container resource usage
├── security/
│   ├── mod.rs        # Security module re-exports
│   ├── registry.rs   # Registry allowlist enforcement
│   ├── seccomp.rs    # Seccomp profiles (Minimal/Default/GPU/Network)
│   ├── signing.rs    # Cosign signature verification
│   └── tls.rs        # TLS certificate pinning (stubbed)
└── update/
    ├── mod.rs        # Update checker (polls coordinator)
    ├── download.rs   # Binary download with progress tracking
    ├── verify.rs     # Ed25519 + SHA256 binary verification
    └── restart.rs    # Graceful restart with backup and rollback

Agent Lifecycle

The Agent struct is the central orchestrator. On startup it:

  1. Generates or loads a host ID — persisted to ~/.archipelag/state.json
  2. Connects to Docker — via local socket (/var/run/docker.sock)
  3. Initializes the cache manager — scans existing Docker images, optionally pre-pulls configured images
  4. Loads signing keys — fetches trusted public keys from the coordinator, falls back to disk cache
  5. Configures the registry allowlist — defaults to ghcr.io/archipelag-io, docker.io/archipelag, docker.io/library
  6. Connects to NATS — with infinite reconnect and retry on initial connect
  7. Registers with the coordinator — sends capabilities (CPU, RAM, GPU, platform)
  8. Enters the main event loop

Main Event Loop

The agent runs a tokio::select! loop that handles multiple concurrent concerns:

┌──────────────────────────────────────────────────┐
│                  Agent Event Loop                │
│                                                  │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────┐ │
│  │  Heartbeat   │  │ Job Assign   │  │ Cancel  │ │
│  │  (10s tick)  │  │ (NATS sub)   │  │ (NATS)  │ │
│  └──────┬───────┘  └──────┬───────┘  └────┬────┘ │
│         │                 │               │      │
│  ┌──────┴───────┐  ┌──────┴───────┐       │      │
│  │ Cache Cleanup│  │ Job Complete │       │      │
│  │  (5 min)     │  │  (mpsc rx)   │       │      │
│  └──────────────┘  └──────────────┘       │      │
│                                           │      │
│  ┌──────────────┐  ┌──────────────┐       │      │
│  │ Update Check │  │   Ctrl+C     │       │      │
│  │  (30 min)    │  │  (shutdown)  │       │      │
│  └──────────────┘  └──────────────┘       │      │
└──────────────────────────────────────────────────┘
EventIntervalAction
Heartbeat10 secondsSends enhanced heartbeat with system/GPU/cache metrics; reconnects after 3 failures
Job assignmentOn messageSpawns job execution task, tracks in active_jobs counter
Job completionOn channelDecrements active_jobs, publishes final status
Cancel requestOn messageSignals cancellation via watch channel to running job
Cache cleanup5 minutesEvicts LRU images, removes stale warm Cargo entries
Update check30 minutesPolls coordinator for new agent version
ShutdownCtrl+CSets shutdown flag, waits for active jobs
Heartbeat recovery
If heartbeat publishing fails 3 times consecutively, the agent resubscribes to NATS using core subscriptions with exponential backoff (capped at 30 seconds) plus jitter.

Job Execution Pipeline

When a job assignment arrives, the agent follows this pipeline:

Job received (NATS)
        │
        ▼
  Parse AssignJob message
        │
        ▼
  Publish "started" status
        │
        ├── runtime_type == "wasm"
        │         │
        │         ▼
        │   Download/cache WASM ──► Verify SHA256 hash
        │         │
        │         ▼
        │   WasmExecutor.run()
        │         │
        │         ▼
        │   Process output stream
        │
        ├── runtime_type == "container" (default)
        │         │
        │         ▼
        │   Registry allowlist check ──► Reject if not allowed
        │         │
        │         ▼
        │   Cache: ensure_image() ──► Pull if not cached
        │         │
        │         ▼
        │   Verify signature (cosign) ──► Skip/fail per config
        │         │
        │         ▼
        │   Apply sandbox tier ──► Set resource limits + seccomp
        │         │
        │         ▼
        │   run_verified_container()
        │         │
        │         ▼
        │   Stream output (JSON Lines)
        │
        ▼
  Publish final status (succeeded/failed)

During execution, the agent also handles:

  • Lease renewal every 30 seconds (60-second extensions, up to 5 total)
  • Cancel requests via a watch channel — stops the container/WASM
  • Output streaming — each JSON line is parsed and published to NATS

Container Execution

Containers are executed via the Bollard Docker client with security controls applied per sandbox tier.

ContainerConfig

FieldDefaultPurpose
imageContainer image reference
inputJSON input passed via stdin
gpu_devicesnoneGPU device passthrough (e.g., all)
timeout_seconds300Maximum execution time
memory_bytes8 GBMemory limit
read_only_rootfstrueImmutable root filesystem
tmpfs_mounts256 MB at /tmpWritable temporary storage
cpu_quotanoneCPU time limit (microseconds per 100ms)
network_disabledtrueNo network access by default
sandbox_tierstandardSecurity tier (see below)
seccomp_profileautoSeccomp syscall filter (derived from tier)

Sandbox Tiers

TierMemoryTimeoutNetworkCPUsSeccomp Profile
restricted256 MB60sOff1Minimal (~10 syscalls)
standard1 GB300sOff2Default (~140 syscalls)
elevated8 GB600sOn4GPU or Network profile

Seccomp Profiles

The Island software ships four built-in seccomp profiles:

  • Minimal — Only read, write, close, exit, brk, mmap, munmap, rt_sigreturn, futex, exit_group
  • Default — ~140 syscalls across 9 categories: file ops, process ops, memory, signals, time, scheduler, I/O multiplexing, threading, and safe miscellaneous calls
  • GPU — Default profile plus ioctl and mmap rules for CUDA
  • Network — Default profile plus 18 network syscalls (socket, bind, listen, connect, etc.)

WASM Execution

WASM Cargos run in a Wasmtime sandbox with strict resource limits:

LimitDefaultPurpose
Max memory256 MBPrevents memory exhaustion
Max fuel100 billionInstruction count limit (prevents infinite loops)
Timeout60 secondsWall-clock execution limit

The executor verifies SHA256 hashes before running modules and caches them in ~/.archipelag/wasm-cache/. WASI is used for stdio — input is provided via a memory pipe, output is captured and streamed.

Output Streaming Format

Cargos communicate with the agent via JSON Lines on stdout. Each line is a WorkloadOutput message:

{"type": "status", "message": "Loading model..."}
{"type": "token", "content": "Hello"}
{"type": "token", "content": " world"}
{"type": "progress", "step": 3, "total": 20}
{"type": "image", "data": "<base64>", "format": "png", "width": 512, "height": 512}
{"type": "done", "usage": {"prompt_tokens": 42, "completion_tokens": 128}}
{"type": "error", "message": "Out of memory"}
TypePurpose
statusInformational messages (model loading, warmup)
tokenIndividual text tokens for LLM streaming
progressStep progress for multi-step Cargos (e.g., image diffusion)
imageCompleted image data (base64-encoded)
doneJob complete with optional usage statistics
errorError message (job will be marked failed)

Image Caching

The CacheManager tracks Docker images and warm Cargos to optimize job startup:

  • Image cache — Scans Docker images on startup, tracks last_used and use_count
  • Warm Cargos — Tracks recently-run Cargo IDs for affinity-based placement
  • LRU eviction — Removes tracking entries when max_cached_images (default 20) is exceeded
  • Stale cleanup — Warm Cargo entries expire after warm_ttl_seconds (default 3600)
  • Warmth score — 0-100 composite score reported in heartbeats (warm Cargo +50, cached image +30, use count up to +20)

The cache manager reports stats in enhanced heartbeats so the coordinator can prefer Islands with warm containers for a given Cargo.

Security Features

Registry Allowlist

Before pulling any container image, the Island software checks the image reference against a configurable allowlist. Default allowed registries:

  • ghcr.io/archipelag-io
  • docker.io/archipelag
  • docker.io/library

Images from unrecognized registries are rejected. The allowlist can also require digest pinning (@sha256:...) for additional integrity.

Signature Verification

For trust level 2+ Cargos, the Island software verifies cosign signatures using trusted public keys fetched from the coordinator. Keys are cached locally for offline resilience. If verification fails and require_signature is set, the job is rejected.

Self-Update

The Island software checks for updates every 30 minutes. The update flow:

  1. Query coordinator for latest version
  2. Download new binary with progress tracking
  3. Verify Ed25519 signature and SHA256 checksum
  4. Wait for active jobs to complete (5 min normal, 30s for critical updates)
  5. Back up current binary to ~/.archipelag/backups/
  6. Replace binary and exec into new version

Next Steps

{% card(title="Coordinator", href="/architecture/coordinator/") %} Learn how the coordinator orchestrates Islands and dispatches jobs.

Cargos

Understand the Cargo model — runtime types, trust levels, and I/O protocol.

For Islands

Get started running the Island software on your machine.

{% end %}