A/B Testing

Deploy multiple model versions, split traffic between them, and measure which performs best

A/B Testing

Experimental

A/B testing code exists and the API endpoints are functional, but this feature has not been verified in production with real traffic. Expect breaking changes during beta.

Test different models or configurations against each other with real traffic. The coordinator routes a percentage of jobs to each variant and tracks latency, success rate, and throughput per variant.

How It Works

You create an A/B test:
  Base Cargo: llm-chat (the one consumers request)
  Variant A (control): Mistral 7B → 80% traffic
  Variant B (challenger): Llama 8B → 20% traffic

Consumers submit jobs for "llm-chat" as usual.
The coordinator randomly routes:
  80% → Mistral 7B
  20% → Llama 8B

Per-variant metrics tracked:
  avg latency, success rate, tokens generated

When you have enough data → promote the winner.

Creating a Test

POST /api/v1/ab-tests

{
  "name": "Mistral 7B vs Llama 8B",
  "base_workload_id": 42,
  "variants": [
    {"name": "control", "workload_id": 42, "traffic_percent": 80},
    {"name": "challenger", "workload_id": 57, "traffic_percent": 20}
  ]
}

Then activate it:

POST /api/v1/ab-tests/{id}/activate

Viewing Results

GET /api/v1/ab-tests/{id}/results

{
  "variants": [
    {
      "name": "control",
      "traffic_percent": 80,
      "job_count": 800,
      "avg_latency_ms": 450,
      "success_rate": 99.2,
      "total_tokens": 384000
    },
    {
      "name": "challenger",
      "traffic_percent": 20,
      "job_count": 200,
      "avg_latency_ms": 320,
      "success_rate": 98.5,
      "total_tokens": 96000
    }
  ]
}

In this example, the challenger (Llama 8B) is faster (320ms vs 450ms) but slightly less reliable (98.5% vs 99.2%). Whether to promote depends on your priorities.

Consumer Transparency

A/B testing is invisible to Consumers. They submit jobs for the base Cargo as usual. The only difference is a metadata field on the response showing which variant was used:

{
  "metadata": {
    "ab_variant_name": "challenger",
    "ab_test_id": "test-uuid"
  }
}

Test Lifecycle

Action	Effect
Create	Define test + variants (status: `draft`)
Activate	Start routing traffic (status: `active`)
Pause	Stop routing, keep metrics (status: `paused`)
Complete	End permanently (status: `completed`)

Metrics Tracked

Metric	Per Variant	How Measured
Job count	Yes	Incremented on each job assignment
Avg latency	Yes	Total latency ÷ job count
Success rate	Yes	Succeeded ÷ total × 100%
Total tokens	Yes	Sum of tokens_generated

Billing

Each variant is billed at its own Cargo’s rate. If variant A costs 1 credit and variant B costs 2 credits, the Consumer pays based on whichever variant their job was routed to.

Statistical Significance

The platform automatically computes statistical significance for each comparison:

p-values for success rate (z-test) and latency (t-test)
Confidence intervals showing the likely range of difference
Winner determination: control, challenger, or inconclusive
Sample size estimate: how many more jobs are needed for a conclusive result

View these on the dashboard at /ab-tests/{id} or via the results API.

Auto-Promote

When a variant wins with statistical significance (p < 0.05) and has been tested on 100+ jobs, you can auto-promote it — the platform shifts 100% traffic to the winner and completes the test. Available as a button on the dashboard or via the API.

Consistent User Experience

The same user always sees the same variant throughout a test (sticky sessions). This prevents confusion from seeing different model responses on repeat visits.

Dashboard

Monitor your A/B tests in real-time at app.archipelag.io/ab-tests:

Per-variant metric cards (latency, success rate, tokens)
Statistical significance analysis with p-values and confidence intervals
Lifecycle controls (activate, pause, promote, complete)
Auto-refreshing every 5 seconds

Use Cases

Scenario	Example
Model comparison	Mistral 7B vs Llama 8B — which is faster/better?
Prompt engineering	Same model, different system prompts — which produces better output?
Quantization testing	Q4 vs Q8 of the same model — quality vs speed tradeoff
New model rollout	Route 5% to the new model, 95% to the proven one. Gradually increase.