Terraform State Management Lessons We Learned the Hard Way
The $50K State File Incident
A client's engineer ran terraform apply on a Friday afternoon. The state file had drifted from reality because someone had made manual changes in the AWS console. Terraform's plan showed "47 resources to destroy and recreate." The engineer, in a hurry, approved it.
Forty-seven resources — including three production RDS instances — were destroyed and recreated. The databases came back empty. Four hours of downtime. $50K in lost revenue. The backups worked (thankfully), but the recovery took until 2am.
All because of a state file that nobody was actively managing.
State Management Rule 1: Remote State, Always
If your Terraform state file lives on someone's laptop, it's not a matter of if you'll lose it — it's when.
# backend.tf — this is non-negotiable
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
}
}The S3 backend with DynamoDB locking gives you three critical things:
- Shared access: Everyone works with the same state
- Locking: Two people can't modify state simultaneously
- Encryption: State contains secrets (RDS passwords, API keys)
State Management Rule 2: State Isolation
One giant state file for your entire infrastructure is a recipe for disaster. Split state by environment and component:
terraform/
├── modules/ # Reusable modules
│ ├── networking/
│ ├── database/
│ └── compute/
├── environments/
│ ├── prod/
│ │ ├── networking/ # State: prod/networking/terraform.tfstate
│ │ ├── database/ # State: prod/database/terraform.tfstate
│ │ ├── compute/ # State: prod/compute/terraform.tfstate
│ │ └── monitoring/ # State: prod/monitoring/terraform.tfstate
│ ├── staging/
│ │ ├── networking/
│ │ ├── database/
│ │ └── compute/
│ └── dev/
│ └── ...
Each component has its own state file. Benefits:
- Blast radius: A bad
applyincomputecan't destroy yourdatabase - Speed: Small state files mean fast plan/apply cycles
- Team parallelism: Different teams can work on different components simultaneously
State Management Rule 3: Never Modify State Manually
When state drifts from reality, the temptation is to edit the state file directly. Don't. Use Terraform's built-in state commands:
# Import an existing resource into state
terraform import aws_instance.web i-1234567890abcdef0
# Move a resource to a new address (after refactoring)
terraform state mv aws_instance.old aws_instance.new
# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy
# Show current state for a resource
terraform state show aws_instance.webState Management Rule 4: Prevent Manual Changes
The $50K incident happened because someone made changes in the AWS console. Prevent this:
# Prevent accidental destruction of critical resources
resource "aws_db_instance" "production" {
# ... configuration ...
lifecycle {
prevent_destroy = true # Terraform will refuse to destroy this
}
}
# Tag everything managed by Terraform
resource "aws_instance" "web" {
# ... configuration ...
tags = {
ManagedBy = "terraform"
Environment = var.environment
Component = "compute"
StateFile = "prod/compute/terraform.tfstate"
}
}Set up AWS Config rules or SCPs to alert when someone modifies Terraform-managed resources manually.
State Management Rule 5: Drift Detection
Don't wait for terraform plan to discover drift. Run automated drift detection:
# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
schedule:
- cron: '0 8 * * 1-5' # Every weekday at 8am
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
component: [networking, database, compute, monitoring]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Plan (Drift Check)
run: |
cd environments/prod/${{ matrix.component }}
terraform init
terraform plan -detailed-exitcode -out=drift.plan
continue-on-error: true
- name: Alert on Drift
if: steps.plan.outcome == 'failure'
run: |
# Send Slack alert with drift details
curl -X POST "$SLACK_WEBHOOK" -d "{
\"text\": \"⚠️ Terraform drift detected in prod/${{ matrix.component }}\"
}"State Management Rule 6: State File Backups
S3 versioning gives you state file history, but also set up explicit backups:
resource "aws_s3_bucket_versioning" "state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
id = "state-versions"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 90 # Keep 90 days of state history
}
}
}When something goes wrong (and it will), you can roll back to a previous state version.
The Checklist
Before any terraform apply in production, verify:
- State is remote with locking enabled
- You're targeting the correct workspace/environment
- The plan output matches your expectations (read every line)
- Critical resources have
prevent_destroylifecycle rules - Drift detection is running on a schedule
- State backups are enabled with versioning
Terraform is a powerful tool. State management is what separates teams that use it successfully from teams that have $50K incidents on Friday afternoons. Treat your state files with the same care you treat your production databases — because they control them.