Inference Cache

Automatic caching of similar prompts — cache hits are instant and free

Inference Cache

Experimental

Inference caching code exists but has not been verified at scale in production. Cache behavior and configuration may change during beta.

The coordinator automatically caches job responses and serves them for similar future prompts. Cache hits bypass Island dispatch entirely — the response is instant (0ms inference) and free (no compute credits charged).

How It Works

Consumer submits prompt: "What is quantum computing?"
    │
Coordinator checks cache:
    │
┌───┴─── Cache HIT ───────────────┐
│                                  │
│  Exact match (SHA256 hash)       │ Cache MISS
│  or semantic match (85%+ token   │
│  similarity)                     │
│                                  │
▼                                  ▼
Return cached response          Dispatch to Island
(instant, free)                 Cache the result for future
                                similar prompts

What Gets Cached

Every successful job response is automatically cached, keyed by the combination of:

Cargo (workload ID) — responses are never matched across different Cargos
Input — the prompt text, normalized for whitespace and case

Two matching strategies:

Exact: identical prompt → instant O(1) lookup
Semantic: 85%+ word overlap → catches near-duplicate prompts like “explain quantum computing” and “explain quantum computing please”

Cache Behavior

Aspect	Value
Default TTL	1 hour
Max entries	10,000
Similarity threshold	85% token overlap
Eviction	Least-recently-used
Scope	Per Cargo (no cross-matching)

Billing

Cache hits are free — no credits are charged because no Island compute is used. This can reduce costs by 30-60% for workloads with repetitive prompts (support chatbots, FAQ systems, common questions).

Transparency

Cached responses include metadata so you know it came from cache:

{
  "id": "job-uuid",
  "state": "succeeded",
  "metadata": {
    "cache_hit": true,
    "cache_strategy": "semantic"
  },
  "output": {"text": "Quantum computing uses qubits..."}
}

Cache Stats

Monitor cache performance:

GET /api/v1/cache/stats

{
  "hits": 1247,
  "misses": 3891,
  "size": 2456,
  "hit_rate_percent": 24.3
}

When Caching Helps Most

Use Case	Expected Hit Rate
Customer support chatbot	40-60% (many repeated questions)
FAQ / documentation Q&A	50-70% (finite question set)
Code generation (common patterns)	20-30% (similar boilerplate requests)
Creative writing	5-10% (unique prompts, low overlap)
Image generation	<5% (unique descriptions)

Per-Cargo Configuration

Caching is automatic for all Cargos by default. Cargo publishers can customize caching behavior:

{
  "shard_manifest": {
    "cache_policy": {
      "enabled": true,
      "ttl_seconds": 1800,
      "similarity_threshold": 0.90
    }
  }
}

Set enabled: false to disable caching for a Cargo (useful for real-time data or personalized responses). Adjust similarity_threshold to control how similar prompts need to be for a cache hit.

Consumers don’t need to change their API calls — the coordinator handles everything transparently. The only visible difference is faster responses and lower costs.