Autoscale ECS with SQS Queue: Why Step Scaling Leads to Disaster

Scaling Amazon ECS services based on an SQS queue is a common approach for processing asynchronous workloads. AWS even provides a blueprint for it in the Application Auto Scaling User Guide—but there’s a catch. Most engineers instinctively choose step scaling, thinking they can trigger additional ECS tasks when the queue depth increases.

The problem? Step scaling introduces thrashing, where ECS tasks scale up and down unpredictably, leading to poor performance and wasted AWS costs.

A much better approach is target tracking scaling with metric math. Instead of reacting to queue depth alone, this method adjusts the number of ECS tasks based on how much work each task is handling. This blog explains why step scaling doesn’t work well for ECS and SQS and how target tracking provides a smarter, more stable solution.

Why Step Scaling Fails for ECS and SQS

Step Scaling Ignores Processing Time and Causes Thrashing

Step scaling works by setting fixed queue depth thresholds. For example, you might configure a policy to:

Scale out ECS tasks when the queue has more than 500 messages.
Scale in ECS tasks when the queue has fewer than 100 messages.

At first, this seems logical, but it has a serious flaw: queue depth alone does not determine how many tasks are needed.

Some messages take longer to process than others.
If messages arrive in bursts, ECS may over-scale and then rapidly scale back down, causing instability.
When scale-in happens too soon, tasks shut down before completing their work, leading to wasted processing.

This creates an effect called thrashing, where ECS tasks constantly start and stop, making the system unpredictable and inefficient.

Step Scaling Doesn’t Match Real Workload Demands

Another issue with step scaling is that it treats all workloads the same. If a system suddenly receives many short-lived messages, ECS might spin up too many tasks unnecessarily. On the other hand, if processing takes longer than expected, ECS might not scale up enough, causing delays in message processing.

Step scaling is too rigid to handle the dynamic nature of SQS-based workloads.

The Right Way: Target Tracking Scaling with Metric Math

Instead of reacting to queue depth, target tracking scaling focuses on how much work each ECS task is handling.

How Backlog Per Task Works

Rather than blindly scaling based on queue depth, the system calculates a backlog per task by dividing the total number of visible messages in the queue by the number of currently running ECS tasks.

This tells AWS how much work each task is handling. Instead of saying, “Scale out at 500 messages,” it says, “Each task should handle a reasonable number of messages, and we’ll scale to maintain that balance.”

How to Set the Right Target

To make this work, you need to define an acceptable backlog per task.

This is calculated based on:

The maximum acceptable processing delay for a message (for example, 10 minutes).
The average time it takes to process a single message (for example, 7 seconds).

To determine the acceptable backlog per task, divide the maximum processing delay by the average processing time.

For example, if messages must be processed within 600 seconds (10 minutes) and each message takes 7 seconds to process, then the acceptable backlog per task is 85 messages. This means that each ECS task should handle around 85 messages before scaling up or down.

Implementing Target Tracking Scaling in AWS

With AWS Application Auto Scaling, you can create a target tracking policy that automatically adjusts ECS tasks based on the backlog per task.

AWS CLI Example for Target Tracking Policy

aws application-autoscaling put-scaling-policy \
    --service-namespace ecs \
    --resource-id service/cluster-name/service-name \
    --scalable-dimension ecs:service:DesiredCount \
    --policy-name "SQS-Backlog-Scaling" \
    --policy-type TargetTrackingScaling \
    --target-tracking-scaling-policy-configuration '{
        "TargetValue": 85,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "ECSServiceAverageConcurrentExecutions"
        },
        "ScaleInCooldown": 60,
        "ScaleOutCooldown": 120
    }'

This ensures that ECS scales out when the backlog per task exceeds 85 and scales in when it drops below 85.

Why Target Tracking Is Better Than Step Scaling

No Thrashing: ECS scales smoothly instead of fluctuating wildly.
Better Performance: Messages are processed within the defined service level objective (SLO).
Lower Costs: ECS allocates resources based on actual workload needs, preventing over-provisioning.
Self-Adjusting: As processing times change, ECS automatically adapts to maintain the optimal backlog per task.

Final Thoughts: Stop Using Step Scaling for ECS + SQS

If you’re autoscaling ECS with an SQS queue, step scaling is not the right tool. It’s too rigid, causes thrashing, and ignores actual message processing time.

Instead, use a target tracking scaling policy with metric math to dynamically adjust ECS tasks based on backlog per task. This ensures smooth scaling, optimal performance, and lower AWS costs.

If you’re still relying on step scaling, it’s time to rethink your approach. Target tracking is the thermostat for your workload, keeping it at the right level automatically.

Autoscaling ECS tasks based on SQS might seem easy on the surface

Overwhelmed by AWS?

Struggling with infrastructure? We streamline your setup, strengthen security & optimize cloud costs so you can build great products.

Related AWS best practices blogs

Looking for more interesting AWS blog posts?

Amazon Cognito vs. Auth0: Why Cognito is a Nightmare

Choosing between Amazon Cognito and Auth0 for authentication? One is cheap but frustrating, the other is powerful but expensive—so which one actually works?

Reduce AWS Fargate pull times with SOCI

One of the major drawbacks of AWS Fargate is that the pull times are relatively slow (compared to EC2). This is because EC2 nodes can have a local image cache on the instance. Fargate is serverless co ...

ISO 27001 Compliance in AWS for SaaS: Why It’s Just the Beginning

ISO 27001 compliance in AWS is just the first step—true security for SaaS companies requires continuous improvements, from white-box pentesting to advanced IAM and threat mitigation.

Why Your AWS ECS Task is Stuck in Pending—And What to Do About It

Troubleshooting AWS ECS tasks stuck in pending often reveals underlying infrastructure issues rather than ECS misconfigurations.

You do not need that bastion host, there are better alternatives

This article discusses why you do not need that bastion host and what the alternatives are. Do you have any further questions after reading this article? If so, please contact me.

Terraform module for Prowler security scans

As a solution architect one of the pillars for a solution is cost. There are a lot of paid security scanners for your AWS accounts out there but most of them are quite pricey. For start-ups this cost ...