Deployment Rollback Strategies for AWS

Deployment Rollback Strategies for AWS

Published January 9, 2026 Last updated April 15, 2026
aws rollback deployment best-practices ecs

Deployment Rollback Strategies for AWS

A deployment rollback reverts your application to a previously known-good version after a failed or problematic release. On AWS, the specifics depend on your compute platform (ECS, EKS, Lambda, EC2), but the core principle is the same: get back to working code fast, then figure out what went wrong.

Most teams spend their CI/CD effort on making deployments faster and more reliable. That’s good. But the difference between a 30-second incident and a 2-hour outage usually comes down to how quickly you can undo a bad deploy, not how rarely you ship one.

Why Recovery Time Beats Prevention

You can’t prevent every bad deployment. A config change passes all tests but breaks under production traffic patterns. A dependency update introduces a subtle memory leak that only surfaces after 20 minutes. An IAM policy change blocks access to a downstream service.

What you can control is your mean time to recovery (MTTR). Teams with solid rollback strategies measure incidents in minutes. Teams without them measure incidents in hours, because someone is SSH-ing into a box, manually reverting a task definition, and hoping they picked the right revision.

The goal isn’t zero failed deployments. It’s making failed deployments boring.

ECS Rolling Update Rollbacks

ECS uses rolling updates by default. When you update a service, ECS launches new tasks with the new task definition, waits for them to pass health checks, then drains old tasks. If the new tasks fail health checks, ECS stops the rollout.

Circuit Breaker

ECS has a built-in deployment circuit breaker. When enabled, it tracks how many tasks fail to reach a healthy state. If failures exceed a threshold, ECS automatically rolls back to the previous task definition.

{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "minimumHealthyPercent": 100,
    "maximumPercent": 200
  }
}

Set minimumHealthyPercent to 100 so ECS keeps all existing tasks running until new ones are healthy. This means your service never drops below full capacity during a deploy.

Manual Task Definition Revert

If you need to roll back manually, you can update the service to point at a previous task definition revision:

aws ecs update-service \
  --cluster my-cluster \
  --service api-service \
  --task-definition api:42  # previous known-good revision

This triggers a new rolling deployment using the old task definition. It works, but it’s slow if you’re doing it by hand at 2am. And it doesn’t address whatever infrastructure changes (Terraform, CloudFormation) accompanied the original deploy.

EKS and Helm Rollbacks

Kubernetes tracks revision history for deployments natively. If you deploy a bad image, you can roll back with one command:

kubectl rollout undo deployment/api-server -n production

This reverts to the previous ReplicaSet. Kubernetes handles the rolling update back to the old pod spec.

Helm Atomic Deploys

If you manage Kubernetes releases with Helm, the --atomic flag is worth using. It tells Helm to automatically roll back the entire release if any part of the upgrade fails, including CRDs, config maps, and deployments.

helm upgrade myapp ./charts/myapp \
  --namespace production \
  --atomic \
  --timeout 10m

The rollback happens at the Helm release level, so you get a consistent revert of everything in the chart. Not just the deployment, but the associated services, ingress rules, and config.

Automated Rollbacks with CloudWatch Alarms

Manual rollbacks require someone to notice the problem, log in, and take action. Automated rollbacks remove that bottleneck by tying your deploy pipeline to health metrics.

The pattern: create a CloudWatch alarm on a metric that reliably indicates your service is broken (5xx error rate, p99 latency, healthy host count). Wire that alarm to your deployment system. When the alarm fires during or after a deploy, trigger an automatic rollback.

Good alarm metrics for rollback triggers:

  • 5xx error rate on your ALB. Catches application crashes and unhandled exceptions introduced by the new code.
  • P99 latency. Detects performance regressions that wouldn’t show up as errors.
  • Healthy host count in your target group. Catches containers that fail to start or fail health checks.
  • Custom application metrics. Business-specific signals like order processing rate or queue depth.

Here’s a Terraform example for a 5xx alarm:

resource "aws_cloudwatch_metric_alarm" "service_health" {
  alarm_name          = "api-health-alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 10

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
    TargetGroup  = aws_lb_target_group.api.arn_suffix
  }
}

Two evaluation periods of 60 seconds means the alarm fires after about two minutes of sustained errors. That’s a reasonable balance between catching real problems and ignoring brief spikes.

Rollback with Dev Ramps

Dev Ramps has two built-in rollback mechanisms that handle the orchestration for you.

Auto-Rollback

Add auto_rollback_alarm_name to a stage in your pipeline.yaml, pointing at a CloudWatch alarm in your target account:

stages:
  - name: production
    account_id: "222222222222"
    region: us-east-1
    auto_rollback_alarm_name: api-health-alarm

When that alarm fires during a deployment, Dev Ramps cancels the in-progress deploy and rolls back to the last successfully deployed revision. It skips Terraform during the rollback to keep it fast and avoid unintended infrastructure changes. Both the current stage and the next stage are blocked from receiving new deploys until you manually unblock them. That blocking is intentional: it prevents the broken code from auto-promoting further down the pipeline.

If the alarm fires after a deployment has already succeeded, Dev Ramps blocks the stage from new deployments until the alarm clears. This gives you time to investigate without a new push accidentally making things worse.

Emergency Rollback

For situations where you need to manually intervene, the dashboard has an Emergency Deploy button. Select “Rollback,” pick the target revision, and optionally skip specific steps. Dev Ramps cancels any in-progress deployment, re-runs all deployment steps with the older revision, and blocks automatic promotions until you explicitly unblock.

You can also roll forward if you’ve already pushed a fix and want to skip ahead, deploying it directly to the affected stage without waiting for it to work through earlier stages.

All of these operations are recorded in the audit trail with who did it and why.

For more details, see the auto-rollback and managing deployments docs.

Limiting Blast Radius with Sequential Stages

A rollback strategy isn’t just about reverting. It’s about limiting how many users see a bad deploy before you catch it.

If you deploy to three AWS regions, don’t deploy to all of them simultaneously. Use sequential stages so the first region acts as a canary:

stages:
  - name: prod-us-east-1
    account_id: "222222222222"
    region: us-east-1
    auto_rollback_alarm_name: api-health-us-east-1

  - name: prod-us-west-2
    account_id: "222222222222"
    region: us-west-2
    auto_rollback_alarm_name: api-health-us-west-2

  - name: prod-eu-west-1
    account_id: "222222222222"
    region: eu-west-1
    auto_rollback_alarm_name: api-health-eu-west-1

If the deploy breaks in us-east-1, the auto-rollback fires and blocks promotion to the remaining regions. Only a fraction of your traffic was affected.

Testing Your Rollback

A rollback strategy you’ve never tested is a rollback strategy that doesn’t work. Specifically:

  1. Deploy a known-bad version to staging. Intentionally ship something that fails health checks and confirm that your circuit breaker or auto-rollback kicks in. Check that it reverts to the correct revision, not just any previous one.
  2. Time it. If your rollback takes 15 minutes, that’s 15 minutes of downtime (or degraded service) every time a bad deploy gets through. Know that number.
  3. Test the manual path too. Automated rollback covers the common case. But when automation fails, someone needs to know how to roll back by hand. Run a drill where an engineer does it from the dashboard or CLI.
  4. Verify after rollback. Check that the old version actually works after being re-deployed. Task definitions that reference deleted ECR images or expired secrets will fail on rollback.

Conclusion

The best rollback strategy combines automated health-check-driven rollbacks for the common case with a clear manual process for everything else. On AWS, that means ECS circuit breakers or CloudWatch alarm-triggered rollbacks for automated recovery, sequential stage promotion to limit blast radius, and a tested manual procedure for when things go sideways in ways you didn’t predict.

Start with a single CloudWatch alarm on your 5xx rate and wire it to your deployment pipeline. You can get more sophisticated later, but that one step covers most production incidents.