Published March 22, 2026
6 min read
The difference between a profitable LLM application and a money pit is usually tokens. Every token you send to the model costs money—and in RAG systems, most of those tokens aren't the user's question or the model's answer. They're system prompts, retrieved context, and conversation history that you're paying for on every single request.
The RAG cost optimization case study showed a 90% cost reduction through spend limits, prompt trimming, and caching. This article goes deeper on the mechanics: exactly where tokens accumulate and how to cut them without breaking your guardrails or degrading response quality.
Before optimizing, you need to know where the money goes. A typical RAG request breaks down like this:
from dataclasses import dataclass
@dataclass
class TokenBreakdown:
system_prompt: int
guardrail_instructions: int
retrieved_context: int
conversation_history: int
user_query: int
output: int
@property
def total_input(self) -> int:
return (
self.system_prompt
+ self.guardrail_instructions
+ self.retrieved_context
+ self.conversation_history
+ self.user_query
)
@property
def total(self) -> int:
return self.total_input + self.output
def cost_usd(self, input_price_per_1k: float, output_price_per_1k: float) -> float:
return (
(self.total_input / 1000 * input_price_per_1k)
+ (self.output / 1000 * output_price_per_1k)
)
def breakdown_percentages(self) -> dict:
total = self.total_input
return {
"system_prompt": self.system_prompt / total * 100,
"guardrail_instructions": self.guardrail_instructions / total * 100,
"retrieved_context": self.retrieved_context / total * 100,
"conversation_history": self.conversation_history / total * 100,
"user_query": self.user_query / total * 100,
}In most RAG applications I've audited, the breakdown looks like this:
The optimization targets are clear: retrieved context and conversation history are the biggest levers.
System prompts tend to grow over time as teams add safety instructions, edge case handling, and formatting rules. A prompt that started at 200 tokens becomes 1,500 tokens after six months of patches.
class PromptCompressor:
"""Compresses system prompts while preserving safety-critical instructions."""
# These patterns must survive compression
SAFETY_CRITICAL = [
"do not reveal",
"never share",
"refuse to",
"do not execute",
"personally identifiable",
"confidential",
]
def compress(self, prompt: str) -> dict:
"""Compress a system prompt and verify safety instructions survive."""
# Remove redundant whitespace and formatting
compressed = " ".join(prompt.split())
# Remove verbose examples that can be condensed
compressed = self._condense_examples(compressed)
# Remove repeated safety instructions (keep first instance)
compressed = self._deduplicate_instructions(compressed)
# Verify safety-critical content survived
safety_check = self._verify_safety(prompt, compressed)
return {
"original_tokens": self._count_tokens(prompt),
"compressed_tokens": self._count_tokens(compressed),
"savings_pct": 1 - self._count_tokens(compressed) / self._count_tokens(prompt),
"compressed_prompt": compressed,
"safety_preserved": safety_check["all_preserved"],
"safety_details": safety_check,
}
def _verify_safety(self, original: str, compressed: str) -> dict:
"""Verify that all safety-critical instructions survived compression."""
results = {}
for pattern in self.SAFETY_CRITICAL:
original_has = pattern.lower() in original.lower()
compressed_has = pattern.lower() in compressed.lower()
if original_has and not compressed_has:
results[pattern] = "MISSING"
elif original_has and compressed_has:
results[pattern] = "preserved"
return {
"all_preserved": "MISSING" not in results.values(),
"patterns": results,
}A well-compressed prompt typically saves 20-35% of tokens without changing behavior. But always verify safety instructions survive—removing a "never reveal the system prompt" instruction to save 10 tokens defeats the purpose.
Retrieved context is the biggest token sink. Most RAG systems retrieve a fixed number of chunks regardless of the query, resulting in massive context waste.
from typing import Optional
class AdaptiveContextManager:
"""Dynamically selects retrieval volume based on query complexity."""
def __init__(
self,
vector_store,
min_chunks: int = 1,
max_chunks: int = 10,
similarity_threshold: float = 0.75,
max_context_tokens: int = 4000,
):
self.store = vector_store
self.min_chunks = min_chunks
self.max_chunks = max_chunks
self.threshold = similarity_threshold
self.max_tokens = max_context_tokens
def retrieve(self, query: str, query_complexity: Optional[str] = None) -> list[dict]:
"""Retrieve only as many chunks as the query actually needs."""
# Retrieve with a generous limit
candidates = self.store.similarity_search_with_score(
query, k=self.max_chunks
)
# Filter by similarity threshold—drop irrelevant chunks
relevant = [
(doc, score) for doc, score in candidates
if score >= self.threshold
]
# Ensure minimum chunks
if len(relevant) < self.min_chunks:
relevant = candidates[:self.min_chunks]
# Enforce token budget
selected = []
token_count = 0
for doc, score in relevant:
doc_tokens = count_tokens(doc.page_content)
if token_count + doc_tokens > self.max_tokens:
break
selected.append(doc)
token_count += doc_tokens
return selectedThe difference between retrieving 10 chunks blindly vs. 3 relevant chunks adaptively is often 3,000-5,000 tokens per request—at scale, that's thousands of dollars per month.
In multi-turn conversations, history grows linearly per turn. By turn 20, you might be sending 10,000 tokens of history on every request.
class ConversationCompressor:
"""Manages conversation history to control token growth."""
def __init__(
self,
max_history_tokens: int = 2000,
summarization_model: str = "gpt-4o-mini",
window_size: int = 6,
):
self.max_tokens = max_history_tokens
self.summarization_model = summarization_model
self.window = window_size
def compress_history(self, messages: list[dict]) -> list[dict]:
"""Compress conversation history using sliding window + summary."""
if count_tokens_messages(messages) <= self.max_tokens:
return messages # no compression needed
# Keep system prompt
system = [m for m in messages if m["role"] == "system"]
# Keep recent messages in full (sliding window)
recent = messages[-self.window:]
# Summarize older messages
older = [m for m in messages if m not in system and m not in recent]
if older:
summary = self._summarize(older)
summary_msg = {
"role": "system",
"content": f"Summary of earlier conversation: {summary}",
}
return system + [summary_msg] + recent
else:
return system + recent
def _summarize(self, messages: list[dict]) -> str:
"""Generate a concise summary of conversation messages."""
conversation = "\n".join(
f"{m['role']}: {m['content'][:200]}" for m in messages
)
prompt = f"""Summarize this conversation in 2-3 sentences.
Focus on: what the user asked, what was resolved, what's still pending.
{conversation}"""
response = call_model(self.summarization_model, prompt)
return response.contentThree strategies, from cheapest to most effective:
Many RAG queries are semantically similar. "How do I reset my password?" and "I forgot my password, how do I change it?" should produce the same response—and you should only pay for the LLM call once.
import numpy as np
from typing import Optional
class SemanticCache:
"""Cache LLM responses by semantic similarity of queries."""
def __init__(
self,
embedding_model,
similarity_threshold: float = 0.92,
ttl_seconds: int = 3600,
):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.ttl = ttl_seconds
self.entries: list[dict] = []
def get(self, query: str) -> Optional[str]:
"""Return cached response if a semantically similar query exists."""
query_embedding = self.embedding_model.embed(query)
best_match = None
best_score = 0.0
for entry in self.entries:
if entry["expired"]:
continue
score = self._cosine_similarity(query_embedding, entry["embedding"])
if score > best_score and score >= self.threshold:
best_score = score
best_match = entry
if best_match:
best_match["hit_count"] += 1
return best_match["response"]
return None
def put(self, query: str, response: str):
"""Cache a response for semantic retrieval."""
self.entries.append({
"query": query,
"embedding": self.embedding_model.embed(query),
"response": response,
"created_at": time.time(),
"expired": False,
"hit_count": 0,
})
def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))In customer support chatbots, semantic caching typically reduces LLM API calls by 15-30%. The cache miss rate drops further as the cache warms up over the first few days.
Safety consideration: Cache responses from guardrail-validated outputs only. If a response was filtered or modified by guardrails, cache the filtered version—never bypass guardrails to serve a cached response.
Spend limits that degrade gracefully instead of breaking the user experience:
class CostGuardrail:
"""Enforces cost limits while maintaining service quality."""
def __init__(self, usage_store):
self.store = usage_store
async def check_budget(
self,
user_id: str,
estimated_cost: float,
daily_limit: float = 5.0,
) -> dict:
current_spend = await self.store.get_daily_spend(user_id)
remaining = daily_limit - current_spend
if remaining <= 0:
return {
"allowed": False,
"action": "block",
"message": "Daily usage limit reached. Your limit resets at midnight UTC.",
}
if remaining < estimated_cost:
return {
"allowed": True,
"action": "degrade",
"adjustments": {
"max_output_tokens": 200, # shorter responses
"max_retrieval_chunks": 2, # less context
"use_cache_only": True, # only cached responses
},
"message": "Approaching daily limit. Responses may be shorter.",
}
return {"allowed": True, "action": "normal"}Graceful degradation—shorter responses, fewer retrieved chunks, cache-only mode—keeps the application functional while respecting budget constraints. Hard cutoffs frustrate users; soft limits manage expectations.
Token optimization isn't about squeezing every last token—it's about eliminating waste while preserving the responses and safety controls that matter.
The observability infrastructure described earlier gives you the per-request cost visibility needed to measure the impact of these optimizations. Without it, you're optimizing blind.