A/B Testing
Deploy multiple model versions, split traffic between them, and measure which performs best
A/B Testing
Test different models or configurations against each other with real traffic. The coordinator routes a percentage of jobs to each variant and tracks latency, success rate, and throughput per variant.
How It Works
You create an A/B test:
Base Cargo: llm-chat (the one consumers request)
Variant A (control): Mistral 7B → 80% traffic
Variant B (challenger): Llama 8B → 20% traffic
Consumers submit jobs for "llm-chat" as usual.
The coordinator randomly routes:
80% → Mistral 7B
20% → Llama 8B
Per-variant metrics tracked:
avg latency, success rate, tokens generated
When you have enough data → promote the winner.
Creating a Test
POST /api/v1/ab-tests
{
"name": "Mistral 7B vs Llama 8B",
"base_workload_id": 42,
"variants": [
{"name": "control", "workload_id": 42, "traffic_percent": 80},
{"name": "challenger", "workload_id": 57, "traffic_percent": 20}
]
}
Then activate it:
POST /api/v1/ab-tests/{id}/activate
Viewing Results
GET /api/v1/ab-tests/{id}/results
{
"variants": [
{
"name": "control",
"traffic_percent": 80,
"job_count": 800,
"avg_latency_ms": 450,
"success_rate": 99.2,
"total_tokens": 384000
},
{
"name": "challenger",
"traffic_percent": 20,
"job_count": 200,
"avg_latency_ms": 320,
"success_rate": 98.5,
"total_tokens": 96000
}
]
}
In this example, the challenger (Llama 8B) is faster (320ms vs 450ms) but slightly less reliable (98.5% vs 99.2%). Whether to promote depends on your priorities.
Consumer Transparency
A/B testing is invisible to Consumers. They submit jobs for the base Cargo as usual. The only difference is a metadata field on the response showing which variant was used:
{
"metadata": {
"ab_variant_name": "challenger",
"ab_test_id": "test-uuid"
}
}
Test Lifecycle
| Action | Effect |
|---|---|
| Create | Define test + variants (status: draft) |
| Activate | Start routing traffic (status: active) |
| Pause | Stop routing, keep metrics (status: paused) |
| Complete | End permanently (status: completed) |
Metrics Tracked
| Metric | Per Variant | How Measured |
|---|---|---|
| Job count | Yes | Incremented on each job assignment |
| Avg latency | Yes | Total latency ÷ job count |
| Success rate | Yes | Succeeded ÷ total × 100% |
| Total tokens | Yes | Sum of tokens_generated |
Billing
Each variant is billed at its own Cargo’s rate. If variant A costs 1 credit and variant B costs 2 credits, the Consumer pays based on whichever variant their job was routed to.
Statistical Significance
The platform automatically computes statistical significance for each comparison:
- p-values for success rate (z-test) and latency (t-test)
- Confidence intervals showing the likely range of difference
- Winner determination: control, challenger, or inconclusive
- Sample size estimate: how many more jobs are needed for a conclusive result
View these on the dashboard at /ab-tests/{id} or via the results API.
Auto-Promote
When a variant wins with statistical significance (p < 0.05) and has been tested on 100+ jobs, you can auto-promote it — the platform shifts 100% traffic to the winner and completes the test. Available as a button on the dashboard or via the API.
Consistent User Experience
The same user always sees the same variant throughout a test (sticky sessions). This prevents confusion from seeing different model responses on repeat visits.
Dashboard
Monitor your A/B tests in real-time at app.archipelag.io/ab-tests:
- Per-variant metric cards (latency, success rate, tokens)
- Statistical significance analysis with p-values and confidence intervals
- Lifecycle controls (activate, pause, promote, complete)
- Auto-refreshing every 5 seconds
Use Cases
| Scenario | Example |
|---|---|
| Model comparison | Mistral 7B vs Llama 8B — which is faster/better? |
| Prompt engineering | Same model, different system prompts — which produces better output? |
| Quantization testing | Q4 vs Q8 of the same model — quality vs speed tradeoff |
| New model rollout | Route 5% to the new model, 95% to the proven one. Gradually increase. |
