Cache Saver

A Modular Framework for Efficient, Affordable,
and Reproducible LLM Inference

Nearchos Potamitis1, Lars Klein2, Bardia Mohammadi3, Chongyang Xu3,
Attreyee Mukherjee3, Laurent Bindschaedler3, Niket Tandon4, Akhil Arora1
1Aarhus University   2EPFL 3MPI SWS   4Microsoft Research  
~25%
Cost Reduction
~35%
CO2 Reduction
Up to ~60%
Savings In Practical Scenarios

Abstract

Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. At its heart is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating i.i.d. responses within a namespace while enabling response reuse across namespaces, all while guaranteeing full reproducibility. On average across five reasoning strategies, five benchmark tasks, and three LLMs, Cache Saver reduces cost by ~25% and CO2 by ~35%. In practical scenarios such as benchmarking and ablation analysis, savings reach up to 60%.

Quick Start

$ pip install cachesaver

Just change the import — everything else stays the same.

Before
from openai import AsyncOpenAI

  client = AsyncOpenAI()

  response = await client.chat.completions.create(
      model="gpt-4.1-nano",
      messages=[{
          "role": "user",
          "content": "What is the capital of France?"
      }],
  )
  print(response.choices[0].message.content)
After
from cachesaver.models.openai import AsyncOpenAI

  client = AsyncOpenAI()

  response = await client.chat.completions.create(
      model="gpt-4.1-nano",
      messages=[{
          "role": "user",
          "content": "What is the capital of France?"
      }],
  )
  print(response.choices[0].message.content)

Key Features

This Is Not Your Typical KV Cache

Naive caches map each prompt to a single response — reusing that response destroys the statistical properties of the generative model. Cache Saver introduces a fundamentally different paradigm: a namespace-aware list-valued cache managed through stochastic coupling. Within a namespace, every response to a given prompt is guaranteed to be i.i.d. — never reused within the same experiment. Across namespaces, responses are reused, driving the cost savings without sacrificing the statistical integrity of your results.


Reproducibility

Namespaces track which cached responses have been consumed, so re-running an experiment from scratch replays the exact same results in the exact same order — even for duplicate prompts.

Reproducibility
# Run 1 — calls the API
results_run1 = await classify(sentences, namespace="experiment_v1")

# Run 2 — same namespace, identical results from cache
results_run2 = await classify(sentences, namespace="experiment_v1")
assert results_run1 == results_run2  # Always true

Error Recovery

Crash on item 7 of 10? Re-run and items 1–6 are served from cache instantly. Only items 7–10 hit the API.

Error Recovery
# Attempt 1 — crashes at item 7
try:
    results = await process(items, namespace="my_exp")
except RuntimeError:
    pass  # Items 1-6 are cached

# Attempt 2 — items 1-6 from cache, only 7-10 call API
results = await process(items, namespace="my_exp")

Async Parallelism

Fully async-native. Use asyncio.gather for concurrent requests — Cache Saver handles batching, deduplication, and caching transparently.

Concurrent Requests
import asyncio

results = await asyncio.gather(*[
    client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[{"role": "user", "content": prompt}],
    )
    for prompt in prompts
])

Request Deduplication

Duplicate prompts within a batch are automatically merged, reducing redundant API calls. Responses are redistributed to all original requesters while preserving correctness.


Deterministic Ordering

When multiple async tasks process the same prompt concurrently, Cache Saver caches by request content — not completion order. A built-in reordering module ensures replays are deterministic regardless of which task finishes first.

Key Results

Multi-step reasoning strategies (Tree-of-Thought, ReAct, RAP, FoA, ReST-MCTS*, etc.) are highly repetitive — ~50% of prompts are duplicates both within a single method execution and across methods on the same task. Cache Saver exploits this redundancy across three practical scenarios:

Practical application results across cost, tokens, latency, and throughput for hyperparameter tuning, ablation analysis, and benchmarking

Three practical scenarios using GPT-4.1-Nano across the benchmarks of Game of 24, HumanEval, and SciBench.

The figure shows Cache Saver's impact across three practical ML scenarios. A1-Hyperparameter tuning: grid search over Tree-of-Thought configurations (tree width, depth, number of evaluations). A2-Ablation analysis: testing three variations of the FoA algorithm (removing the selection phase, backtracking, or resampling). A3-Benchmarking: comparing entirely different reasoning strategies (ToT, GoT, FoA).

The blue bars show the cost without Cache Saver. The orange bars show the average cost with Cache Saver. Because experiments share prompts, cached responses are reused and average cost drops significantly. The green bars show the marginal cost, that is the added cost of incorporating one more configuration, variation, or method into the experiment.

The reuse potential depends on how similar the experiments are: hyperparameter tuning (A1) achieves the highest savings (6x lower cost, tokens, and latency) since different configurations of the same method share most prompts. Ablation analysis (A2) achieves 2.5x savings. Finally, benchmarking across different methods (A3) still achieves 2x savings, a notable finding since even structurally different reasoning strategies share significant prompt overlap. These savings are on top of existing platform-level optimizations (paged attention, KV caching, prefix sharing, etc.).

Examples

Tutorial

Full walkthrough: quickstart, reproducibility, error recovery, parallelism, ReAct agents, Tree-of-Thought, and RAG pipelines.

View Notebook

Provider Examples

Usage examples for all supported providers: OpenAI, Anthropic, Gemini, Together, Groq, OpenRouter, and more.

View Notebook

Architecture

Cache Saver composes four async pipeline components around your model:

Component Role
Cacher Namespace-aware list-valued cache with per-key async mutexes. Tracks per-namespace usage counts for i.i.d. sampling.
Deduplicator Merges duplicate prompts within a batch by (hash, namespace), combines n values, redistributes responses.
Reorderer Sorts by stable identifier before processing, restores original order after. Ensures deterministic results.
Batcher Async producer-consumer queue. Groups requests by batch size with timeout.

Supported Providers

All cloud providers share the same interface as their original SDK. Just change the import.

OpenAI
from cachesaver.models.openai import AsyncOpenAI, OpenAI
Anthropic
from cachesaver.models.anthropic import AsyncAnthropic, Anthropic
Google Gemini
from cachesaver.models.gemini import AsyncGemini, Gemini
Together AI
from cachesaver.models.together import AsyncTogether
Groq
from cachesaver.models.groq import AsyncGroq, Groq
OpenRouter
from cachesaver.models.openrouter import AsyncOpenRouter, OpenRouter
HuggingFace (Inference Providers)
from cachesaver.models.huggingface import AsyncHuggingFace, HuggingFace
vLLM
from cachesaver.models.vllm import AsyncVLLM, VLLM
HuggingFace (Transformers)
from cachesaver.models.transformers import AsyncHFTransformers, HFTransformers

BibTeX

@inproceedings{
potamitis2025cache,
title={Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible {LLM} Inference},
author={Nearchos Potamitis and Lars Henning Klein and Bardia Mohammadi and Chongyang Xu and Attreyee Mukherjee and Niket Tandon and Laurent Bindschaedler and Akhil Arora},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=2Nxih3ySSi}
}

Cache Saver — Aarhus University · EPFL · MPI SWS · Microsoft Research
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.