Open Beta Archipelag.io is in open beta until June 2026. All credits and earnings are virtual. Read the announcement →

A/B Testing

Deploy multiple model versions, split traffic between them, and measure which performs best

A/B Testing

Experimental
A/B testing code exists and the API endpoints are functional, but this feature has not been verified in production with real traffic. Expect breaking changes during beta.

Test different models or configurations against each other with real traffic. The coordinator routes a percentage of jobs to each variant and tracks latency, success rate, and throughput per variant.

How It Works

You create an A/B test:
  Base Cargo: llm-chat (the one consumers request)
  Variant A (control): Mistral 7B → 80% traffic
  Variant B (challenger): Llama 8B → 20% traffic

Consumers submit jobs for "llm-chat" as usual.
The coordinator randomly routes:
  80% → Mistral 7B
  20% → Llama 8B

Per-variant metrics tracked:
  avg latency, success rate, tokens generated

When you have enough data → promote the winner.

Creating a Test

POST /api/v1/ab-tests
{
  "name": "Mistral 7B vs Llama 8B",
  "base_workload_id": 42,
  "variants": [
    {"name": "control", "workload_id": 42, "traffic_percent": 80},
    {"name": "challenger", "workload_id": 57, "traffic_percent": 20}
  ]
}

Then activate it:

POST /api/v1/ab-tests/{id}/activate

Viewing Results

GET /api/v1/ab-tests/{id}/results
{
  "variants": [
    {
      "name": "control",
      "traffic_percent": 80,
      "job_count": 800,
      "avg_latency_ms": 450,
      "success_rate": 99.2,
      "total_tokens": 384000
    },
    {
      "name": "challenger",
      "traffic_percent": 20,
      "job_count": 200,
      "avg_latency_ms": 320,
      "success_rate": 98.5,
      "total_tokens": 96000
    }
  ]
}

In this example, the challenger (Llama 8B) is faster (320ms vs 450ms) but slightly less reliable (98.5% vs 99.2%). Whether to promote depends on your priorities.

Consumer Transparency

A/B testing is invisible to Consumers. They submit jobs for the base Cargo as usual. The only difference is a metadata field on the response showing which variant was used:

{
  "metadata": {
    "ab_variant_name": "challenger",
    "ab_test_id": "test-uuid"
  }
}

Test Lifecycle

ActionEffect
CreateDefine test + variants (status: draft)
ActivateStart routing traffic (status: active)
PauseStop routing, keep metrics (status: paused)
CompleteEnd permanently (status: completed)

Metrics Tracked

MetricPer VariantHow Measured
Job countYesIncremented on each job assignment
Avg latencyYesTotal latency ÷ job count
Success rateYesSucceeded ÷ total × 100%
Total tokensYesSum of tokens_generated

Billing

Each variant is billed at its own Cargo’s rate. If variant A costs 1 credit and variant B costs 2 credits, the Consumer pays based on whichever variant their job was routed to.

Statistical Significance

The platform automatically computes statistical significance for each comparison:

  • p-values for success rate (z-test) and latency (t-test)
  • Confidence intervals showing the likely range of difference
  • Winner determination: control, challenger, or inconclusive
  • Sample size estimate: how many more jobs are needed for a conclusive result

View these on the dashboard at /ab-tests/{id} or via the results API.

Auto-Promote

When a variant wins with statistical significance (p < 0.05) and has been tested on 100+ jobs, you can auto-promote it — the platform shifts 100% traffic to the winner and completes the test. Available as a button on the dashboard or via the API.

Consistent User Experience

The same user always sees the same variant throughout a test (sticky sessions). This prevents confusion from seeing different model responses on repeat visits.

Dashboard

Monitor your A/B tests in real-time at app.archipelag.io/ab-tests:

  • Per-variant metric cards (latency, success rate, tokens)
  • Statistical significance analysis with p-values and confidence intervals
  • Lifecycle controls (activate, pause, promote, complete)
  • Auto-refreshing every 5 seconds

Use Cases

ScenarioExample
Model comparisonMistral 7B vs Llama 8B — which is faster/better?
Prompt engineeringSame model, different system prompts — which produces better output?
Quantization testingQ4 vs Q8 of the same model — quality vs speed tradeoff
New model rolloutRoute 5% to the new model, 95% to the proven one. Gradually increase.