Coordinator
Architecture of the Archipelag.io coordinator — the control plane that orchestrates Islands, jobs, and billing
Coordinator
The coordinator is the control plane for Archipelag.io. Built with Elixir and Phoenix, it handles Island registration, job dispatch, billing, karma tracking, and real-time streaming — all supervised by OTP for fault tolerance.
Supervision Tree
The coordinator starts as a standard OTP application under a :one_for_one supervisor:
Coordinator.Supervisor (:one_for_one)
│
├── CoordinatorWeb.Telemetry # Phoenix telemetry handlers
├── Coordinator.Repo # Ecto/PostgreSQL connection pool
├── Coordinator.PromEx # Prometheus metrics collection
├── DNSCluster # Fly.io DNS-based clustering
├── Phoenix.PubSub # In-memory pub/sub for real-time events
├── Oban # Background job queue (PostgreSQL-backed)
├── Jobs.OutputAccumulator # Aggregates streaming output chunks
├── CoordinatorWeb.Endpoint # Phoenix HTTP/WebSocket server
│
└── Nats.Supervisor (conditional) # NATS message bus connection
├── Nats.HostHandler # Host registration + heartbeats
├── Nats.JobHandler # Job output + progress streaming
└── Nats.StatusConsumer # JetStream pull consumer for job status
Context Modules
The coordinator follows Phoenix’s context pattern — each domain area has a dedicated module that encapsulates its business logic and database access.
| Context | Module | Responsibility |
|---|---|---|
| Jobs | Coordinator.Jobs | Job lifecycle: submit, dispatch, assign, status transitions, lease management, retry, cancellation |
| Hosts | Coordinator.Hosts | Island registration, heartbeats, capability matching, warm container tracking, pairing, suspension |
| Karma | Coordinator.Karma | Reputation scoring, compute-time earnings, penalties, monetization threshold, metro settings |
| Billing | Coordinator.Billing | Credit charges (atomic), refunds, dynamic pricing, Island payouts, quota checks |
| Accounts | Coordinator.Accounts | User accounts, authentication, API keys, magic links |
| Workloads | Coordinator.Workloads | Cargo catalog, requirements matching, runtime types |
| Cargo Registry | Coordinator.Marketplace | Cargo submissions, reviews, security scanning, publishing |
| Verification | Coordinator.Verification | KYC via Stripe Identity, admin verification dashboards |
NATS Message Handlers
The coordinator communicates with Islands through NATS. Three GenServer/consumer processes handle different message types:
HostHandler
Subscribes to core NATS subjects for Island lifecycle:
| Subject | Purpose |
|---|---|
coordinator.hosts.register | Island registration — upserts host record, broadcasts :host_online |
coordinator.hosts.pairing | Pairing code generation (request/reply) |
host.*.heartbeat | Heartbeat updates — refreshes last_heartbeat_at and active_jobs |
host.*.lease | Lease renewal requests from Islands executing jobs |
JobHandler
Subscribes to Island output subjects for real-time streaming:
| Subject | Purpose |
|---|---|
host.*.output | Text token chunks — accumulated and broadcast to WebSocket via PubSub |
host.*.progress | Progress steps (e.g., image generation step 3/20) |
host.*.image | Completed image data with format, dimensions, and seed |
StatusConsumer
A JetStream pull consumer on the JOB_STATUS stream. Processes durable job status messages (started, succeeded, failed) with at-least-once delivery guarantees. Malformed messages are acknowledged to prevent infinite redelivery.
Background Workers
Oban provides reliable background job processing backed by PostgreSQL. Workers are scheduled via cron or enqueued on demand.
| Worker | Queue | Schedule | Purpose |
|---|---|---|---|
HostHealthWorker | maintenance | Every 1 min | Marks Islands offline if heartbeat exceeds 30s timeout |
JobLeaseWorker | maintenance | Every 1 min | Expires leases on assigned jobs, penalizes karma, requeues |
DeadLetterWorker | maintenance | Every 5 min | Fails jobs stuck in assigned state past threshold |
JobRetryWorker | maintenance | Every 5 min | Retries jobs waiting in queued state |
AuthCleanupWorker | maintenance | Every 1 hour | Deletes expired magic link tokens and sessions |
ReputationWorker | maintenance | Every 1 hour | Auto-suspends low-reputation Cargos, flags for review |
RescanScheduler | maintenance | Weekly (Sun 3 AM) | Triggers security rescans for Cargos not scanned in 7+ days |
PayoutBatchWorker | payouts | Weekly (Sun 2 AM) | Processes Stripe Connect payouts for eligible Islands |
WorkloadScanWorker | security_scans | On demand | Runs trivy + grype container image scans |
Database Schema
The coordinator uses PostgreSQL with Ecto. Key tables and their primary key types:
| Table | PK Type | Key Fields |
|---|---|---|
users | integer | email, credits, kyc_status, banned |
hosts | binary_id (UUID) | name, status, region, karma, capabilities, last_heartbeat_at |
jobs | binary_id (UUID) | state, user_id, host_id, workload_id, metadata, lease_expires_at |
workloads | integer | name, slug, runtime_type, trust_level, sandbox_tier, pricing fields |
karma_events | integer | host_id, job_id, event_type, delta, balance_after |
transactions | integer | user_id, job_id, amount, type (charge/refund) |
workload_submissions | integer | developer_id, status, scan results, signature fields |
Telemetry and Metrics
The coordinator exposes Prometheus metrics via PromEx at GET /metrics (auth-protected in production). Metrics are organized into categories:
| Category | Polling Interval | Examples |
|---|---|---|
| Island fleet | 15s | Islands online, by region, by status |
| Jobs | 10s | Jobs by state, dispatch latency, completion rate |
| Billing | 30s | Revenue, charges, refunds, payout volume |
| Ecto | Automatic | Query duration, queue time, pool stats |
| Phoenix | Automatic | Request duration, WebSocket connections |
Structured logging with correlation IDs (job_id, host_id, workload_id) is available via Coordinator.StructuredLogger. In production, logs are emitted as JSON via Coordinator.JsonLogger.
LiveView and Channels
Real-time features are delivered through Phoenix LiveView and Channels:
| Interface | Purpose |
|---|---|
JobChannel | WebSocket channel for streaming job output to users (ownership-checked) |
MobileAgentChannel | WebSocket channel for iOS/Android agent pairing and communication |
| Chat LiveView | Real-time LLM chat with token-by-token streaming |
| Admin dashboards | Island management, job monitoring, KYC verification |
Next Steps
Cargos
Understand runtime types, trust levels, and the Cargo execution model.
System Overview
See how all components fit together in the distributed architecture.
{% end %}
