Skip to main contentSkip to navigation
llm
cost-optimization
rag

Token Optimization: Cutting LLM Costs Without Sacrificing Safety

Practical techniques for reducing LLM API costs in RAG and chatbot systems—prompt compression, retrieval slimming, caching—without weakening guardrails.

Publicado March 22, 2026

6 min de lectura

The difference between a profitable LLM application and a money pit is usually tokens. Every token you send to the model costs money—and in RAG systems, most of those tokens aren't the user's question or the model's answer. They're system prompts, retrieved context, and conversation history that you're paying for on every single request.

The RAG cost optimization case study showed a 90% cost reduction through spend limits, prompt trimming, and caching. This article goes deeper on the mechanics: exactly where tokens accumulate and how to cut them without breaking your guardrails or degrading response quality.

Where Tokens Go in RAG Systems

Before optimizing, you need to know where the money goes. A typical RAG request breaks down like this:

from dataclasses import dataclass
 
@dataclass
class TokenBreakdown:
    system_prompt: int
    guardrail_instructions: int
    retrieved_context: int
    conversation_history: int
    user_query: int
    output: int
 
    @property
    def total_input(self) -> int:
        return (
            self.system_prompt
            + self.guardrail_instructions
            + self.retrieved_context
            + self.conversation_history
            + self.user_query
        )
 
    @property
    def total(self) -> int:
        return self.total_input + self.output
 
    def cost_usd(self, input_price_per_1k: float, output_price_per_1k: float) -> float:
        return (
            (self.total_input / 1000 * input_price_per_1k)
            + (self.output / 1000 * output_price_per_1k)
        )
 
    def breakdown_percentages(self) -> dict:
        total = self.total_input
        return {
            "system_prompt": self.system_prompt / total * 100,
            "guardrail_instructions": self.guardrail_instructions / total * 100,
            "retrieved_context": self.retrieved_context / total * 100,
            "conversation_history": self.conversation_history / total * 100,
            "user_query": self.user_query / total * 100,
        }

In most RAG applications I've audited, the breakdown looks like this:

  • System prompt + guardrails: 15-25% (500-2,000 tokens)
  • Retrieved context: 40-60% (2,000-8,000 tokens)
  • Conversation history: 10-30% (500-4,000 tokens, grows per turn)
  • User query: 2-5% (50-200 tokens)
  • Output: 10-20% (200-1,500 tokens)

The optimization targets are clear: retrieved context and conversation history are the biggest levers.

System Prompt Compression

System prompts tend to grow over time as teams add safety instructions, edge case handling, and formatting rules. A prompt that started at 200 tokens becomes 1,500 tokens after six months of patches.

class PromptCompressor:
    """Compresses system prompts while preserving safety-critical instructions."""
 
    # These patterns must survive compression
    SAFETY_CRITICAL = [
        "do not reveal",
        "never share",
        "refuse to",
        "do not execute",
        "personally identifiable",
        "confidential",
    ]
 
    def compress(self, prompt: str) -> dict:
        """Compress a system prompt and verify safety instructions survive."""
        # Remove redundant whitespace and formatting
        compressed = " ".join(prompt.split())
 
        # Remove verbose examples that can be condensed
        compressed = self._condense_examples(compressed)
 
        # Remove repeated safety instructions (keep first instance)
        compressed = self._deduplicate_instructions(compressed)
 
        # Verify safety-critical content survived
        safety_check = self._verify_safety(prompt, compressed)
 
        return {
            "original_tokens": self._count_tokens(prompt),
            "compressed_tokens": self._count_tokens(compressed),
            "savings_pct": 1 - self._count_tokens(compressed) / self._count_tokens(prompt),
            "compressed_prompt": compressed,
            "safety_preserved": safety_check["all_preserved"],
            "safety_details": safety_check,
        }
 
    def _verify_safety(self, original: str, compressed: str) -> dict:
        """Verify that all safety-critical instructions survived compression."""
        results = {}
        for pattern in self.SAFETY_CRITICAL:
            original_has = pattern.lower() in original.lower()
            compressed_has = pattern.lower() in compressed.lower()
            if original_has and not compressed_has:
                results[pattern] = "MISSING"
            elif original_has and compressed_has:
                results[pattern] = "preserved"
 
        return {
            "all_preserved": "MISSING" not in results.values(),
            "patterns": results,
        }

A well-compressed prompt typically saves 20-35% of tokens without changing behavior. But always verify safety instructions survive—removing a "never reveal the system prompt" instruction to save 10 tokens defeats the purpose.

Context Window Management

Retrieved context is the biggest token sink. Most RAG systems retrieve a fixed number of chunks regardless of the query, resulting in massive context waste.

from typing import Optional
 
class AdaptiveContextManager:
    """Dynamically selects retrieval volume based on query complexity."""
 
    def __init__(
        self,
        vector_store,
        min_chunks: int = 1,
        max_chunks: int = 10,
        similarity_threshold: float = 0.75,
        max_context_tokens: int = 4000,
    ):
        self.store = vector_store
        self.min_chunks = min_chunks
        self.max_chunks = max_chunks
        self.threshold = similarity_threshold
        self.max_tokens = max_context_tokens
 
    def retrieve(self, query: str, query_complexity: Optional[str] = None) -> list[dict]:
        """Retrieve only as many chunks as the query actually needs."""
        # Retrieve with a generous limit
        candidates = self.store.similarity_search_with_score(
            query, k=self.max_chunks
        )
 
        # Filter by similarity threshold—drop irrelevant chunks
        relevant = [
            (doc, score) for doc, score in candidates
            if score >= self.threshold
        ]
 
        # Ensure minimum chunks
        if len(relevant) < self.min_chunks:
            relevant = candidates[:self.min_chunks]
 
        # Enforce token budget
        selected = []
        token_count = 0
        for doc, score in relevant:
            doc_tokens = count_tokens(doc.page_content)
            if token_count + doc_tokens > self.max_tokens:
                break
            selected.append(doc)
            token_count += doc_tokens
 
        return selected

The difference between retrieving 10 chunks blindly vs. 3 relevant chunks adaptively is often 3,000-5,000 tokens per request—at scale, that's thousands of dollars per month.

Conversation History Strategies

In multi-turn conversations, history grows linearly per turn. By turn 20, you might be sending 10,000 tokens of history on every request.

class ConversationCompressor:
    """Manages conversation history to control token growth."""
 
    def __init__(
        self,
        max_history_tokens: int = 2000,
        summarization_model: str = "gpt-4o-mini",
        window_size: int = 6,
    ):
        self.max_tokens = max_history_tokens
        self.summarization_model = summarization_model
        self.window = window_size
 
    def compress_history(self, messages: list[dict]) -> list[dict]:
        """Compress conversation history using sliding window + summary."""
        if count_tokens_messages(messages) <= self.max_tokens:
            return messages  # no compression needed
 
        # Keep system prompt
        system = [m for m in messages if m["role"] == "system"]
 
        # Keep recent messages in full (sliding window)
        recent = messages[-self.window:]
 
        # Summarize older messages
        older = [m for m in messages if m not in system and m not in recent]
 
        if older:
            summary = self._summarize(older)
            summary_msg = {
                "role": "system",
                "content": f"Summary of earlier conversation: {summary}",
            }
            return system + [summary_msg] + recent
        else:
            return system + recent
 
    def _summarize(self, messages: list[dict]) -> str:
        """Generate a concise summary of conversation messages."""
        conversation = "\n".join(
            f"{m['role']}: {m['content'][:200]}" for m in messages
        )
        prompt = f"""Summarize this conversation in 2-3 sentences.
Focus on: what the user asked, what was resolved, what's still pending.
 
{conversation}"""
 
        response = call_model(self.summarization_model, prompt)
        return response.content

Three strategies, from cheapest to most effective:

  1. Sliding window (no LLM call): Keep last N messages, drop the rest. Simple but loses context.
  2. Summarization (one LLM call with a cheap model): Summarize older messages, keep recent ones verbatim. Best balance of cost and quality.
  3. Hybrid: Sliding window for most conversations, summarization only when context is critical (e.g., support tickets with history).

Semantic Caching

Many RAG queries are semantically similar. "How do I reset my password?" and "I forgot my password, how do I change it?" should produce the same response—and you should only pay for the LLM call once.

import numpy as np
from typing import Optional
 
class SemanticCache:
    """Cache LLM responses by semantic similarity of queries."""
 
    def __init__(
        self,
        embedding_model,
        similarity_threshold: float = 0.92,
        ttl_seconds: int = 3600,
    ):
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.entries: list[dict] = []
 
    def get(self, query: str) -> Optional[str]:
        """Return cached response if a semantically similar query exists."""
        query_embedding = self.embedding_model.embed(query)
 
        best_match = None
        best_score = 0.0
 
        for entry in self.entries:
            if entry["expired"]:
                continue
 
            score = self._cosine_similarity(query_embedding, entry["embedding"])
            if score > best_score and score >= self.threshold:
                best_score = score
                best_match = entry
 
        if best_match:
            best_match["hit_count"] += 1
            return best_match["response"]
 
        return None
 
    def put(self, query: str, response: str):
        """Cache a response for semantic retrieval."""
        self.entries.append({
            "query": query,
            "embedding": self.embedding_model.embed(query),
            "response": response,
            "created_at": time.time(),
            "expired": False,
            "hit_count": 0,
        })
 
    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

In customer support chatbots, semantic caching typically reduces LLM API calls by 15-30%. The cache miss rate drops further as the cache warms up over the first few days.

Safety consideration: Cache responses from guardrail-validated outputs only. If a response was filtered or modified by guardrails, cache the filtered version—never bypass guardrails to serve a cached response.

Cost Guardrails

Spend limits that degrade gracefully instead of breaking the user experience:

class CostGuardrail:
    """Enforces cost limits while maintaining service quality."""
 
    def __init__(self, usage_store):
        self.store = usage_store
 
    async def check_budget(
        self,
        user_id: str,
        estimated_cost: float,
        daily_limit: float = 5.0,
    ) -> dict:
        current_spend = await self.store.get_daily_spend(user_id)
        remaining = daily_limit - current_spend
 
        if remaining <= 0:
            return {
                "allowed": False,
                "action": "block",
                "message": "Daily usage limit reached. Your limit resets at midnight UTC.",
            }
 
        if remaining < estimated_cost:
            return {
                "allowed": True,
                "action": "degrade",
                "adjustments": {
                    "max_output_tokens": 200,    # shorter responses
                    "max_retrieval_chunks": 2,   # less context
                    "use_cache_only": True,       # only cached responses
                },
                "message": "Approaching daily limit. Responses may be shorter.",
            }
 
        return {"allowed": True, "action": "normal"}

Graceful degradation—shorter responses, fewer retrieved chunks, cache-only mode—keeps the application functional while respecting budget constraints. Hard cutoffs frustrate users; soft limits manage expectations.

Conclusion

Token optimization isn't about squeezing every last token—it's about eliminating waste while preserving the responses and safety controls that matter.

  1. Retrieved context is the biggest lever. Adaptive retrieval with similarity thresholds cuts 40-60% of context tokens without degrading answer quality.
  2. Compress prompts, but never remove safety instructions. A 30% shorter system prompt saves thousands of dollars over a year, but losing a guardrail instruction costs more than money.
  3. Cache semantically, not literally. Semantic caching captures the long tail of similar queries that exact-match caching misses—and always serve cached responses through guardrails.

The observability infrastructure described earlier gives you the per-request cost visibility needed to measure the impact of these optimizations. Without it, you're optimizing blind.

Token Optimization: Cutting LLM Costs Without Sacrificing Safety | Musah Abdulai