Terraform State Management Lessons We Learned the Hard Way

March 2, 2026·ScaledByDesign·

terraforminfrastructure-as-codedevopsawsstate-management

The $50K State File Incident

A client's engineer ran terraform apply on a Friday afternoon. The state file had drifted from reality because someone had made manual changes in the AWS console. Terraform's plan showed "47 resources to destroy and recreate." The engineer, in a hurry, approved it.

Forty-seven resources — including three production RDS instances — were destroyed and recreated. The databases came back empty. Four hours of downtime. $50K in lost revenue. The backups worked (thankfully), but the recovery took until 2am.

All because of a state file that nobody was actively managing.

State Management Rule 1: Remote State, Always

If your Terraform state file lives on someone's laptop, it's not a matter of if you'll lose it — it's when.

# backend.tf — this is non-negotiable
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

The S3 backend with DynamoDB locking gives you three critical things:

Shared access: Everyone works with the same state
Locking: Two people can't modify state simultaneously
Encryption: State contains secrets (RDS passwords, API keys)

State Management Rule 2: State Isolation

One giant state file for your entire infrastructure is a recipe for disaster. Split state by environment and component:

terraform/
├── modules/              # Reusable modules
│   ├── networking/
│   ├── database/
│   └── compute/
├── environments/
│   ├── prod/
│   │   ├── networking/   # State: prod/networking/terraform.tfstate
│   │   ├── database/     # State: prod/database/terraform.tfstate
│   │   ├── compute/      # State: prod/compute/terraform.tfstate
│   │   └── monitoring/   # State: prod/monitoring/terraform.tfstate
│   ├── staging/
│   │   ├── networking/
│   │   ├── database/
│   │   └── compute/
│   └── dev/
│       └── ...

Each component has its own state file. Benefits:

Blast radius: A bad apply in compute can't destroy your database
Speed: Small state files mean fast plan/apply cycles
Team parallelism: Different teams can work on different components simultaneously

State Management Rule 3: Never Modify State Manually

When state drifts from reality, the temptation is to edit the state file directly. Don't. Use Terraform's built-in state commands:

# Import an existing resource into state
terraform import aws_instance.web i-1234567890abcdef0
 
# Move a resource to a new address (after refactoring)
terraform state mv aws_instance.old aws_instance.new
 
# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy
 
# Show current state for a resource
terraform state show aws_instance.web

State Management Rule 4: Prevent Manual Changes

The $50K incident happened because someone made changes in the AWS console. Prevent this:

# Prevent accidental destruction of critical resources
resource "aws_db_instance" "production" {
  # ... configuration ...
 
  lifecycle {
    prevent_destroy = true  # Terraform will refuse to destroy this
  }
}
 
# Tag everything managed by Terraform
resource "aws_instance" "web" {
  # ... configuration ...
  tags = {
    ManagedBy   = "terraform"
    Environment = var.environment
    Component   = "compute"
    StateFile   = "prod/compute/terraform.tfstate"
  }
}

Set up AWS Config rules or SCPs to alert when someone modifies Terraform-managed resources manually.

State Management Rule 5: Drift Detection

Don't wait for terraform plan to discover drift. Run automated drift detection:

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Every weekday at 8am
 
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        component: [networking, database, compute, monitoring]
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Terraform Plan (Drift Check)
        run: |
          cd environments/prod/${{ matrix.component }}
          terraform init
          terraform plan -detailed-exitcode -out=drift.plan
        continue-on-error: true
 
      - name: Alert on Drift
        if: steps.plan.outcome == 'failure'
        run: |
          # Send Slack alert with drift details
          curl -X POST "$SLACK_WEBHOOK" -d "{
            \"text\": \"⚠️ Terraform drift detected in prod/${{ matrix.component }}\"
          }"

State Management Rule 6: State File Backups

S3 versioning gives you state file history, but also set up explicit backups:

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}
 
resource "aws_s3_bucket_lifecycle_configuration" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    id     = "state-versions"
    status = "Enabled"
    noncurrent_version_expiration {
      noncurrent_days = 90  # Keep 90 days of state history
    }
  }
}

When something goes wrong (and it will), you can roll back to a previous state version.

The Checklist

Before any terraform apply in production, verify:

State is remote with locking enabled
You're targeting the correct workspace/environment
The plan output matches your expectations (read every line)
Critical resources have prevent_destroy lifecycle rules
Drift detection is running on a schedule
State backups are enabled with versioning

Terraform is a powerful tool. State management is what separates teams that use it successfully from teams that have $50K incidents on Friday afternoons. Treat your state files with the same care you treat your production databases — because they control them.

CQRS Without the Complexity — A Practical Implementation Guide

When Your VP of Engineering Should Actually Be Three People

Terraform State Management Lessons We Learned the Hard Way

March 2, 2026·ScaledByDesign·

terraforminfrastructure-as-codedevopsawsstate-management

The $50K State File Incident

All because of a state file that nobody was actively managing.

State Management Rule 1: Remote State, Always

If your Terraform state file lives on someone's laptop, it's not a matter of if you'll lose it — it's when.

# backend.tf — this is non-negotiable
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

The S3 backend with DynamoDB locking gives you three critical things:

Shared access: Everyone works with the same state
Locking: Two people can't modify state simultaneously
Encryption: State contains secrets (RDS passwords, API keys)

State Management Rule 2: State Isolation

One giant state file for your entire infrastructure is a recipe for disaster. Split state by environment and component:

terraform/
├── modules/              # Reusable modules
│   ├── networking/
│   ├── database/
│   └── compute/
├── environments/
│   ├── prod/
│   │   ├── networking/   # State: prod/networking/terraform.tfstate
│   │   ├── database/     # State: prod/database/terraform.tfstate
│   │   ├── compute/      # State: prod/compute/terraform.tfstate
│   │   └── monitoring/   # State: prod/monitoring/terraform.tfstate
│   ├── staging/
│   │   ├── networking/
│   │   ├── database/
│   │   └── compute/
│   └── dev/
│       └── ...

Each component has its own state file. Benefits:

Blast radius: A bad apply in compute can't destroy your database
Speed: Small state files mean fast plan/apply cycles
Team parallelism: Different teams can work on different components simultaneously

State Management Rule 3: Never Modify State Manually

When state drifts from reality, the temptation is to edit the state file directly. Don't. Use Terraform's built-in state commands:

# Import an existing resource into state
terraform import aws_instance.web i-1234567890abcdef0
 
# Move a resource to a new address (after refactoring)
terraform state mv aws_instance.old aws_instance.new
 
# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy
 
# Show current state for a resource
terraform state show aws_instance.web

State Management Rule 4: Prevent Manual Changes

The $50K incident happened because someone made changes in the AWS console. Prevent this:

# Prevent accidental destruction of critical resources
resource "aws_db_instance" "production" {
  # ... configuration ...
 
  lifecycle {
    prevent_destroy = true  # Terraform will refuse to destroy this
  }
}
 
# Tag everything managed by Terraform
resource "aws_instance" "web" {
  # ... configuration ...
  tags = {
    ManagedBy   = "terraform"
    Environment = var.environment
    Component   = "compute"
    StateFile   = "prod/compute/terraform.tfstate"
  }
}

Set up AWS Config rules or SCPs to alert when someone modifies Terraform-managed resources manually.

State Management Rule 5: Drift Detection

Don't wait for terraform plan to discover drift. Run automated drift detection:

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Every weekday at 8am
 
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        component: [networking, database, compute, monitoring]
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Terraform Plan (Drift Check)
        run: |
          cd environments/prod/${{ matrix.component }}
          terraform init
          terraform plan -detailed-exitcode -out=drift.plan
        continue-on-error: true
 
      - name: Alert on Drift
        if: steps.plan.outcome == 'failure'
        run: |
          # Send Slack alert with drift details
          curl -X POST "$SLACK_WEBHOOK" -d "{
            \"text\": \"⚠️ Terraform drift detected in prod/${{ matrix.component }}\"
          }"

State Management Rule 6: State File Backups

S3 versioning gives you state file history, but also set up explicit backups:

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}
 
resource "aws_s3_bucket_lifecycle_configuration" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    id     = "state-versions"
    status = "Enabled"
    noncurrent_version_expiration {
      noncurrent_days = 90  # Keep 90 days of state history
    }
  }
}

When something goes wrong (and it will), you can roll back to a previous state version.

The Checklist

Before any terraform apply in production, verify:

State is remote with locking enabled
You're targeting the correct workspace/environment
The plan output matches your expectations (read every line)
Critical resources have prevent_destroy lifecycle rules
Drift detection is running on a schedule
State backups are enabled with versioning

CQRS Without the Complexity — A Practical Implementation Guide

When Your VP of Engineering Should Actually Be Three People

Terraform State Management Lessons We Learned the Hard Way

The $50K State File Incident

State Management Rule 1: Remote State, Always

State Management Rule 2: State Isolation

State Management Rule 3: Never Modify State Manually

State Management Rule 4: Prevent Manual Changes

State Management Rule 5: Drift Detection

State Management Rule 6: State File Backups

The Checklist

Ready to Ship?

Terraform State Management Lessons We Learned the Hard Way

The $50K State File Incident

State Management Rule 1: Remote State, Always

State Management Rule 2: State Isolation

State Management Rule 3: Never Modify State Manually

State Management Rule 4: Prevent Manual Changes

State Management Rule 5: Drift Detection

State Management Rule 6: State File Backups

The Checklist

Ready to Ship?