Open Beta Archipelag.io is in open beta until June 2026. All credits and earnings are virtual. Read the announcement →

Inference Cache

Automatic caching of similar prompts — cache hits are instant and free

Inference Cache

Experimental
Inference caching code exists but has not been verified at scale in production. Cache behavior and configuration may change during beta.

The coordinator automatically caches job responses and serves them for similar future prompts. Cache hits bypass Island dispatch entirely — the response is instant (0ms inference) and free (no compute credits charged).

How It Works

Consumer submits prompt: "What is quantum computing?"
    │
Coordinator checks cache:
    │
┌───┴─── Cache HIT ───────────────┐
│                                  │
│  Exact match (SHA256 hash)       │ Cache MISS
│  or semantic match (85%+ token   │
│  similarity)                     │
│                                  │
▼                                  ▼
Return cached response          Dispatch to Island
(instant, free)                 Cache the result for future
                                similar prompts

What Gets Cached

Every successful job response is automatically cached, keyed by the combination of:

  • Cargo (workload ID) — responses are never matched across different Cargos
  • Input — the prompt text, normalized for whitespace and case

Two matching strategies:

  • Exact: identical prompt → instant O(1) lookup
  • Semantic: 85%+ word overlap → catches near-duplicate prompts like “explain quantum computing” and “explain quantum computing please”

Cache Behavior

AspectValue
Default TTL1 hour
Max entries10,000
Similarity threshold85% token overlap
EvictionLeast-recently-used
ScopePer Cargo (no cross-matching)

Billing

Cache hits are free — no credits are charged because no Island compute is used. This can reduce costs by 30-60% for workloads with repetitive prompts (support chatbots, FAQ systems, common questions).

Transparency

Cached responses include metadata so you know it came from cache:

{
  "id": "job-uuid",
  "state": "succeeded",
  "metadata": {
    "cache_hit": true,
    "cache_strategy": "semantic"
  },
  "output": {"text": "Quantum computing uses qubits..."}
}

Cache Stats

Monitor cache performance:

GET /api/v1/cache/stats
{
  "hits": 1247,
  "misses": 3891,
  "size": 2456,
  "hit_rate_percent": 24.3
}

When Caching Helps Most

Use CaseExpected Hit Rate
Customer support chatbot40-60% (many repeated questions)
FAQ / documentation Q&A50-70% (finite question set)
Code generation (common patterns)20-30% (similar boilerplate requests)
Creative writing5-10% (unique prompts, low overlap)
Image generation<5% (unique descriptions)

Per-Cargo Configuration

Caching is automatic for all Cargos by default. Cargo publishers can customize caching behavior:

{
  "shard_manifest": {
    "cache_policy": {
      "enabled": true,
      "ttl_seconds": 1800,
      "similarity_threshold": 0.90
    }
  }
}

Set enabled: false to disable caching for a Cargo (useful for real-time data or personalized responses). Adjust similarity_threshold to control how similar prompts need to be for a cache hit.

Consumers don’t need to change their API calls — the coordinator handles everything transparently. The only visible difference is faster responses and lower costs.