ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The ExperimentWhat We BuiltThe Data: What It CatchesWhat It MissesThe Impact on Team VelocityThe CostThe Honest Assessment
  1. Insights
  2. AI & Automation
  3. We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)

We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)

February 25, 2026·ScaledByDesign·
aicode-reviewautomationdeveloper-toolsllm

The Experiment

Six months ago, we built an AI code review bot for a client's engineering team. Not a wrapper around ChatGPT — a purpose-built system that integrates with their GitHub workflow, understands their codebase, and provides structured feedback on every pull request.

The team was skeptical. "AI can't understand our code" was the polite version of the pushback. After 6 months and 3,400 pull requests, we have real data.

What We Built

The bot runs on every PR and provides three types of feedback:

interface CodeReviewResult {
  bugs: Finding[];          // Potential bugs and logic errors
  style: Finding[];         // Style and convention violations
  security: Finding[];      // Security vulnerabilities
  performance: Finding[];   // Performance concerns
  suggestions: Finding[];   // Improvement suggestions
  confidence: number;       // 0-1, how confident the bot is
}
 
// Only comments with confidence > 0.7 are posted to the PR
// Lower confidence findings go to a dashboard for human review

The architecture uses a RAG pipeline with the codebase as context:

PR diff → Chunk into logical changes
       → Retrieve relevant codebase context (similar files, tests, types)
       → Generate structured review with citations
       → Filter by confidence threshold
       → Post to GitHub PR as inline comments

The Data: What It Catches

After 3,400 PRs, here's the breakdown of findings that humans confirmed as valid:

CategoryFindingsTrue Positive RateExamples
Missing null checks31289%Unhandled optional chaining, missing undefined guards
Type mismatches18794%Wrong argument types, missing type assertions
Unused imports/vars45698%Dead code that linters sometimes miss in complex cases
SQL injection risks2378%String concatenation in queries, missing parameterization
Race conditions4162%Async operations without proper locks or ordering
Error handling gaps19885%Missing try/catch, swallowed errors, missing error types
API contract violations8971%Response shapes that don't match API specs
Test coverage gaps26782%Missing edge case tests, untested error paths

Overall true positive rate: 83%. That means 17% of the bot's comments were false positives — noise that developers had to read and dismiss.

What It Misses

The bot consistently fails at:

1. Business Logic Errors: The bot can't understand that a discount shouldn't apply to already-discounted items because that's a business rule, not a code pattern.

2. Architectural Concerns: "This service is doing too much" or "this should be a separate module" requires understanding system design intent that the bot doesn't have.

3. Performance at Scale: The bot catches obvious N+1 queries but misses subtler issues like "this works fine at 100 records but will time out at 100K."

4. UX Implications: Code that's technically correct but creates a poor user experience (loading states, error messages, accessibility) is invisible to the bot.

5. Context-Dependent Decisions: "We chose this approach because of X constraint" — the bot often suggests refactors that ignore historical context or business constraints.

The Impact on Team Velocity

Before AI Code Review:
  Average PR review time:     4.2 hours (time to first human review)
  Average review cycles:      2.3 rounds
  PRs merged per dev per week: 3.1
  Bug escape rate:            8.2% (bugs found in staging/production)

After AI Code Review (6 months):
  Average PR review time:     1.8 hours (-57%)
  Average review cycles:      1.6 rounds (-30%)
  PRs merged per dev per week: 4.7 (+52%)
  Bug escape rate:            5.1% (-38%)

The biggest win wasn't catching bugs — it was reducing the first review cycle. The bot catches the mechanical issues (null checks, types, error handling) so human reviewers can focus on architecture, logic, and design.

The Cost

Monthly Cost Breakdown:
  LLM API calls (Claude/GPT-4):  $1,200
  Embedding/RAG infrastructure:  $300
  GitHub Actions compute:        $150
  Engineering maintenance:       ~8 hours/month
  Total:                         ~$2,400/month

ROI Calculation:
  Developer time saved:          ~40 hours/month (across 12-person team)
  At $100/hour fully loaded:     $4,000/month saved
  Bug escape reduction:          ~$2,000/month (estimated incident cost savings)
  Net value:                     ~$3,600/month positive

The Honest Assessment

Worth building? Yes, but only if your team is large enough (8+ engineers) to justify the maintenance overhead.

Replace human reviewers? No. The bot handles ~40% of review feedback (the mechanical stuff). Humans are still essential for the 60% that requires judgment.

Build or buy? For most teams: buy. Tools like CodeRabbit, Sourcery, and GitHub Copilot code review have gotten good. We built custom because the client needed deep codebase context and custom rules. Unless you have specific requirements, start with an off-the-shelf solution.

The 17% false positive problem: This is the biggest risk. If the false positive rate creeps above 20%, developers start ignoring all bot comments. You need active tuning and a feedback loop where developers can thumbs-down bad suggestions.

The AI code review bot isn't a replacement for engineering culture. It's a force multiplier for teams that already have good review practices. If your team doesn't review code at all, a bot won't fix that. If your team reviews code well but slowly, a bot can make them faster.

Build the culture first. Then automate the mechanical parts.

Previous
Your Analytics Are Double-Counting Revenue — And Nobody Noticed
Insights
We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)Prompt Engineering Is Dead — Context Engineering Is What MattersYour AI Agent Isn't Working Because You Skipped the GuardrailsRAG vs Fine-Tuning: When to Use What in ProductionHow to Cut Your LLM Costs by 70% Without Losing QualityThe AI Implementation Playbook for Non-Technical FoundersWhy Most AI Chatbots Fail (And What Production-Grade Looks Like)Building AI Agents That Know When to Hand Off to HumansVibe Coding Is Destroying Your CodebaseAI Won't Fix Your Broken Data Pipeline

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call