Open Beta Archipelag.io is in open beta until June 2026. All credits and earnings are virtual. Read the announcement →

Coordinator

Architecture of the Archipelag.io coordinator — the control plane that orchestrates Islands, jobs, and billing

Coordinator

The coordinator is the control plane for Archipelag.io. Built with Elixir and Phoenix, it handles Island registration, job dispatch, billing, karma tracking, and real-time streaming — all supervised by OTP for fault tolerance.

Supervision Tree

The coordinator starts as a standard OTP application under a :one_for_one supervisor:

Coordinator.Supervisor (:one_for_one)
│
├── CoordinatorWeb.Telemetry          # Phoenix telemetry handlers
├── Coordinator.Repo                  # Ecto/PostgreSQL connection pool
├── Coordinator.PromEx                # Prometheus metrics collection
├── DNSCluster                        # Fly.io DNS-based clustering
├── Phoenix.PubSub                    # In-memory pub/sub for real-time events
├── Oban                              # Background job queue (PostgreSQL-backed)
├── Jobs.OutputAccumulator            # Aggregates streaming output chunks
├── CoordinatorWeb.Endpoint           # Phoenix HTTP/WebSocket server
│
└── Nats.Supervisor (conditional)     # NATS message bus connection
    ├── Nats.HostHandler              # Host registration + heartbeats
    ├── Nats.JobHandler               # Job output + progress streaming
    └── Nats.StatusConsumer           # JetStream pull consumer for job status
NATS is conditional
The NATS supervisor is only started when NATS is configured and not explicitly disabled. In test mode, NATS is always skipped. Set `NATS_DISABLED=true` to disable in other environments.

Context Modules

The coordinator follows Phoenix’s context pattern — each domain area has a dedicated module that encapsulates its business logic and database access.

ContextModuleResponsibility
JobsCoordinator.JobsJob lifecycle: submit, dispatch, assign, status transitions, lease management, retry, cancellation
HostsCoordinator.HostsIsland registration, heartbeats, capability matching, warm container tracking, pairing, suspension
KarmaCoordinator.KarmaReputation scoring, compute-time earnings, penalties, monetization threshold, metro settings
BillingCoordinator.BillingCredit charges (atomic), refunds, dynamic pricing, Island payouts, quota checks
AccountsCoordinator.AccountsUser accounts, authentication, API keys, magic links
WorkloadsCoordinator.WorkloadsCargo catalog, requirements matching, runtime types
Cargo RegistryCoordinator.MarketplaceCargo submissions, reviews, security scanning, publishing
VerificationCoordinator.VerificationKYC via Stripe Identity, admin verification dashboards

NATS Message Handlers

The coordinator communicates with Islands through NATS. Three GenServer/consumer processes handle different message types:

HostHandler

Subscribes to core NATS subjects for Island lifecycle:

SubjectPurpose
coordinator.hosts.registerIsland registration — upserts host record, broadcasts :host_online
coordinator.hosts.pairingPairing code generation (request/reply)
host.*.heartbeatHeartbeat updates — refreshes last_heartbeat_at and active_jobs
host.*.leaseLease renewal requests from Islands executing jobs

JobHandler

Subscribes to Island output subjects for real-time streaming:

SubjectPurpose
host.*.outputText token chunks — accumulated and broadcast to WebSocket via PubSub
host.*.progressProgress steps (e.g., image generation step 3/20)
host.*.imageCompleted image data with format, dimensions, and seed

StatusConsumer

A JetStream pull consumer on the JOB_STATUS stream. Processes durable job status messages (started, succeeded, failed) with at-least-once delivery guarantees. Malformed messages are acknowledged to prevent infinite redelivery.

Background Workers

Oban provides reliable background job processing backed by PostgreSQL. Workers are scheduled via cron or enqueued on demand.

WorkerQueueSchedulePurpose
HostHealthWorkermaintenanceEvery 1 minMarks Islands offline if heartbeat exceeds 30s timeout
JobLeaseWorkermaintenanceEvery 1 minExpires leases on assigned jobs, penalizes karma, requeues
DeadLetterWorkermaintenanceEvery 5 minFails jobs stuck in assigned state past threshold
JobRetryWorkermaintenanceEvery 5 minRetries jobs waiting in queued state
AuthCleanupWorkermaintenanceEvery 1 hourDeletes expired magic link tokens and sessions
ReputationWorkermaintenanceEvery 1 hourAuto-suspends low-reputation Cargos, flags for review
RescanSchedulermaintenanceWeekly (Sun 3 AM)Triggers security rescans for Cargos not scanned in 7+ days
PayoutBatchWorkerpayoutsWeekly (Sun 2 AM)Processes Stripe Connect payouts for eligible Islands
WorkloadScanWorkersecurity_scansOn demandRuns trivy + grype container image scans

Database Schema

The coordinator uses PostgreSQL with Ecto. Key tables and their primary key types:

TablePK TypeKey Fields
usersintegeremail, credits, kyc_status, banned
hostsbinary_id (UUID)name, status, region, karma, capabilities, last_heartbeat_at
jobsbinary_id (UUID)state, user_id, host_id, workload_id, metadata, lease_expires_at
workloadsintegername, slug, runtime_type, trust_level, sandbox_tier, pricing fields
karma_eventsintegerhost_id, job_id, event_type, delta, balance_after
transactionsintegeruser_id, job_id, amount, type (charge/refund)
workload_submissionsintegerdeveloper_id, status, scan results, signature fields
Primary key types matter
Jobs and hosts use UUID (`binary_id`) primary keys. Any foreign key referencing them must specify `type: :binary_id` in migrations and schemas. Users and workloads use standard integer auto-increment IDs.

Telemetry and Metrics

The coordinator exposes Prometheus metrics via PromEx at GET /metrics (auth-protected in production). Metrics are organized into categories:

CategoryPolling IntervalExamples
Island fleet15sIslands online, by region, by status
Jobs10sJobs by state, dispatch latency, completion rate
Billing30sRevenue, charges, refunds, payout volume
EctoAutomaticQuery duration, queue time, pool stats
PhoenixAutomaticRequest duration, WebSocket connections

Structured logging with correlation IDs (job_id, host_id, workload_id) is available via Coordinator.StructuredLogger. In production, logs are emitted as JSON via Coordinator.JsonLogger.

LiveView and Channels

Real-time features are delivered through Phoenix LiveView and Channels:

InterfacePurpose
JobChannelWebSocket channel for streaming job output to users (ownership-checked)
MobileAgentChannelWebSocket channel for iOS/Android agent pairing and communication
Chat LiveViewReal-time LLM chat with token-by-token streaming
Admin dashboardsIsland management, job monitoring, KYC verification

Next Steps

{% card(title="Island", href="/architecture/node-agent/") %} Learn how the Island software executes jobs and communicates with the coordinator.

Cargos

Understand runtime types, trust levels, and the Cargo execution model.

System Overview

See how all components fit together in the distributed architecture.

{% end %}