Inference Cache
Automatic caching of similar prompts — cache hits are instant and free
Inference Cache
The coordinator automatically caches job responses and serves them for similar future prompts. Cache hits bypass Island dispatch entirely — the response is instant (0ms inference) and free (no compute credits charged).
How It Works
Consumer submits prompt: "What is quantum computing?"
│
Coordinator checks cache:
│
┌───┴─── Cache HIT ───────────────┐
│ │
│ Exact match (SHA256 hash) │ Cache MISS
│ or semantic match (85%+ token │
│ similarity) │
│ │
▼ ▼
Return cached response Dispatch to Island
(instant, free) Cache the result for future
similar prompts
What Gets Cached
Every successful job response is automatically cached, keyed by the combination of:
- Cargo (workload ID) — responses are never matched across different Cargos
- Input — the prompt text, normalized for whitespace and case
Two matching strategies:
- Exact: identical prompt → instant O(1) lookup
- Semantic: 85%+ word overlap → catches near-duplicate prompts like “explain quantum computing” and “explain quantum computing please”
Cache Behavior
| Aspect | Value |
|---|---|
| Default TTL | 1 hour |
| Max entries | 10,000 |
| Similarity threshold | 85% token overlap |
| Eviction | Least-recently-used |
| Scope | Per Cargo (no cross-matching) |
Billing
Cache hits are free — no credits are charged because no Island compute is used. This can reduce costs by 30-60% for workloads with repetitive prompts (support chatbots, FAQ systems, common questions).
Transparency
Cached responses include metadata so you know it came from cache:
{
"id": "job-uuid",
"state": "succeeded",
"metadata": {
"cache_hit": true,
"cache_strategy": "semantic"
},
"output": {"text": "Quantum computing uses qubits..."}
}
Cache Stats
Monitor cache performance:
GET /api/v1/cache/stats
{
"hits": 1247,
"misses": 3891,
"size": 2456,
"hit_rate_percent": 24.3
}
When Caching Helps Most
| Use Case | Expected Hit Rate |
|---|---|
| Customer support chatbot | 40-60% (many repeated questions) |
| FAQ / documentation Q&A | 50-70% (finite question set) |
| Code generation (common patterns) | 20-30% (similar boilerplate requests) |
| Creative writing | 5-10% (unique prompts, low overlap) |
| Image generation | <5% (unique descriptions) |
Per-Cargo Configuration
Caching is automatic for all Cargos by default. Cargo publishers can customize caching behavior:
{
"shard_manifest": {
"cache_policy": {
"enabled": true,
"ttl_seconds": 1800,
"similarity_threshold": 0.90
}
}
}
Set enabled: false to disable caching for a Cargo (useful for real-time data or personalized responses). Adjust similarity_threshold to control how similar prompts need to be for a cache hit.
Consumers don’t need to change their API calls — the coordinator handles everything transparently. The only visible difference is faster responses and lower costs.
