ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The 8-Second RAG ResponseThe Original Pipeline (8.2s)Optimization 1: Streaming (Perceived Latency → 0)Optimization 2: Model SelectionOptimization 3: Smarter RetrievalOptimization 4: Embedding CacheOptimization 5: Parallel ExecutionThe Optimized Pipeline (400ms to first token)Quality Validation
  1. Insights
  2. AI & Automation
  3. RAG Pipeline Optimization — From 8s to 400ms

RAG Pipeline Optimization — From 8s to 400ms

April 27, 2026·ScaledByDesign·
ragaillmvector-databaseperformance

The 8-Second RAG Response

A client built a customer support chatbot powered by RAG. It worked — answers were accurate, grounded in their documentation. But each response took 8 seconds. Customers abandoned the chat after 4 seconds. The bot was smart but useless because it was slow.

We optimized the pipeline to 400ms average response time. Same answer quality. Here's every optimization we applied.

The Original Pipeline (8.2s)

Step 1: Receive user query                         ~10ms
Step 2: Embed query with OpenAI ada-002             ~350ms
Step 3: Search Pinecone for top-20 chunks           ~120ms
Step 4: Re-rank all 20 chunks with a cross-encoder  ~800ms
Step 5: Format prompt with all 20 chunks            ~5ms
Step 6: Call GPT-4o with 8K token context           ~6,900ms
Total:                                              ~8,185ms

The bottleneck was obvious: Step 6 (LLM generation) was 84% of latency. But every step had room for improvement.

Optimization 1: Streaming (Perceived Latency → 0)

The single biggest UX improvement — start showing the response immediately:

// Stream the response instead of waiting for completion
async function* streamRAGResponse(query: string) {
  const context = await getRelevantContext(query); // Still takes ~1.2s
  
  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",  // Optimization 2: cheaper model
    messages: [
      { role: "system", content: buildSystemPrompt(context) },
      { role: "user", content: query },
    ],
    stream: true,
    max_tokens: 500,
  });
 
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) yield content;
  }
}
 
// Time to first token: ~1.4s (context retrieval + LLM warmup)
// Perceived latency: user sees text appearing after 1.4s
// Total completion: ~3s (but user is reading, not waiting)

Streaming doesn't reduce total time, but it transforms the experience. Users start reading at 1.4s instead of staring at a spinner for 8s.

Optimization 2: Model Selection

GPT-4o was overkill for 80% of customer support queries:

// Route queries to appropriate models
function selectModel(query: string, complexity: number): ModelConfig {
  // Simple FAQ-type queries → fastest, cheapest model
  if (complexity < 0.3) {
    return { model: "gpt-4o-mini", maxTokens: 300 };
    // Generation time: ~800ms vs ~4,500ms for gpt-4o
  }
 
  // Multi-step reasoning, comparison queries → full model
  if (complexity > 0.7) {
    return { model: "gpt-4o", maxTokens: 800 };
  }
 
  // Default: mid-tier
  return { model: "gpt-4o-mini", maxTokens: 500 };
}
 
// Complexity scoring (cheap, fast classification)
async function scoreComplexity(query: string): Promise<number> {
  // Simple heuristics first (no API call needed)
  const wordCount = query.split(" ").length;
  const hasComparison = /compare|versus|difference|better/i.test(query);
  const hasMultiStep = /and also|then|after that|steps/i.test(query);
  
  let score = 0;
  if (wordCount > 30) score += 0.3;
  if (hasComparison) score += 0.3;
  if (hasMultiStep) score += 0.3;
  
  return Math.min(score, 1);
}

Impact: 70% of queries now use gpt-4o-mini. Average generation time: 1.2s → 600ms.

Optimization 3: Smarter Retrieval

Retrieving 20 chunks and re-ranking all of them was wasteful:

// Before: retrieve 20, re-rank all 20, use top 5
// After: retrieve 8, re-rank top 8, use top 3
 
async function getRelevantContext(query: string): Promise<string> {
  // 1. Embed query (use a faster, local model)
  const embedding = await localEmbed(query); // ~15ms vs 350ms for OpenAI
  
  // 2. Vector search — only retrieve 8 (not 20)
  const candidates = await vectorDB.search(embedding, {
    topK: 8,           // Reduced from 20
    minScore: 0.75,    // Skip low-relevance results entirely
  });
 
  // 3. Re-rank only if we have > 3 candidates
  let topChunks;
  if (candidates.length > 3) {
    topChunks = await rerank(query, candidates, { topK: 3 });
  } else {
    topChunks = candidates;
  }
 
  // 4. Return concatenated context (smaller = faster LLM generation)
  return topChunks.map(c => c.text).join("\n\n---\n\n");
}

Impact: Retrieval step: 920ms → 180ms. Context size: 6K tokens → 2K tokens (smaller context = faster generation).

Optimization 4: Embedding Cache

Many queries are similar. Cache the embedding to skip the embedding step entirely:

// Semantic embedding cache
const embeddingCache = new LRUCache<string, number[]>({ max: 10000 });
 
async function getEmbedding(text: string): Promise<number[]> {
  // Normalize query for cache hits
  const normalized = text.toLowerCase().trim().replace(/\s+/g, " ");
  const cacheKey = hash(normalized);
  
  const cached = embeddingCache.get(cacheKey);
  if (cached) return cached; // Cache hit: 0ms vs 15-350ms
 
  const embedding = await localEmbed(normalized);
  embeddingCache.set(cacheKey, embedding);
  return embedding;
}

Impact: 30% cache hit rate. Average embedding time: 15ms → 10ms.

Optimization 5: Parallel Execution

Don't run steps sequentially when they can overlap:

async function ragPipeline(query: string) {
  // Run embedding and complexity scoring in parallel
  const [embedding, complexity] = await Promise.all([
    getEmbedding(query),
    scoreComplexity(query),
  ]);
 
  // Select model while searching (model selection is instant)
  const [candidates, modelConfig] = await Promise.all([
    vectorDB.search(embedding, { topK: 8, minScore: 0.75 }),
    Promise.resolve(selectModel(query, complexity)),
  ]);
 
  // Re-rank (only if needed)
  const context = candidates.length > 3
    ? await rerank(query, candidates, { topK: 3 })
    : candidates;
 
  // Stream response
  return streamResponse(query, context, modelConfig);
}

The Optimized Pipeline (400ms to first token)

Step 1: Receive query                                ~10ms
Step 2: Embed query (local model + cache)            ~10ms  (was 350ms)
        Score complexity (parallel)                   ~5ms
Step 3: Vector search (top-8, min score 0.75)        ~80ms  (was 120ms)
Step 4: Re-rank top 3 (if needed)                    ~100ms (was 800ms)
Step 5: Format prompt (2K tokens, not 8K)            ~2ms
Step 6: Stream first token (gpt-4o-mini)             ~200ms (was 6,900ms wait)
Time to first token:                                 ~407ms
Full response:                                       ~1.8s  (was 8.2s)

Quality Validation

Speed means nothing if answers get worse. We validated with a test suite:

Answer quality metrics (500 test queries):
  Before optimization:
    Accuracy: 92%
    Relevance: 88%
    Completeness: 85%
    Avg response time: 8.2s

  After optimization:
    Accuracy: 91% (-1%, within margin)
    Relevance: 90% (+2%, better retrieval filtering)
    Completeness: 82% (-3%, shorter context)
    Avg response time: 1.8s (-78%)
    Time to first token: 0.4s (-95%)

The 3% drop in completeness was a trade-off we accepted — answers were slightly less detailed but still accurate. And users actually read them now because they didn't abandon the chat waiting.

RAG performance isn't about picking the fastest LLM. It's about optimizing the entire pipeline — embedding, retrieval, re-ranking, generation — and running as much as possible in parallel. The LLM is often not even the biggest bottleneck once you look at the full picture.

Previous
The Shopify Plus Migration That Saved $400K/Year
Insights
RAG Pipeline Optimization — From 8s to 400msLLM Cost Optimization — How We Cut a Client's AI Bill by 73%AI Hallucination Detection in Production — What Actually WorksWe Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)Prompt Engineering Is Dead — Context Engineering Is What MattersYour AI Agent Isn't Working Because You Skipped the GuardrailsRAG vs Fine-Tuning: When to Use What in ProductionHow to Cut Your LLM Costs by 70% Without Losing QualityThe AI Implementation Playbook for Non-Technical FoundersWhy Most AI Chatbots Fail (And What Production-Grade Looks Like)Building AI Agents That Know When to Hand Off to HumansVibe Coding Is Destroying Your CodebaseAI Won't Fix Your Broken Data Pipeline

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call