Automatically scaling AWS Fargate tasks vertically

When you define a task definition with Fargate you must set the memory and CPU usage of the tasks (and optionally the individual containers) before starting the tasks.

I would suggest in Fargate tasks just to set these values on the task level and not on the container level. If you set it on the task level only, the containers are free to distribute memory and CPU as they see fit if the memory is within the task limit. If you are using EC2 you can use swap for the containers but for Fargate this is not supported.

For some tasks that load variable datasets it can be hard to estimate the required memory usage of a container beforehand. For instance, bigger datasets need more memory. In these cases, I often see customers maxing out the memory on their Fargate task definitions to make sure they can handle such loads.

But you can see if the memory consumption increases above 8GB you must use 2CPUs and this doubles the task pricing. This can get expensive quickly.

Your Fargate memory and CPU combinations must be within these ranges otherwise you will get the error No Fargate configuration exists for given values:

That is why I created a Terraform module that can auto-scale individual tasks automatically. This will not scale tasks that are in a service.

How it works:

  1. It captures the RunTask commands via CloudTrail (if the RunTask command had the fargatevertiscale tag) and stores this in DynamoDB
  2. It looks out for tasks that exited due to an OOM error.
  3. If it is, doubles the task memory (and potentially the CPU to match) and restarts it with the same task parameters.

You can set a maximum amount of memory via the max_memory variable, if you do not set this it will be a maximum of 120GB that will be used.

It uses EventBridge for the task stopping monitoring and a Lambda function that is provisioned for you. It requires access to start tasks in all your clusters. Mainly it requires access through the iam_pass_roles variable so the lambda can start the task and pass the role. I would advice against setting this to [*] as it allows privilige escalation.

Please note that it is essential that your tasks are idempotent for this to work. If a container is halfway finished when it encounters an OOM error, it will restart the task from the beginning. This means your tasks must be able to run without errors if they are executed multiple times.

Make sure your tasks fail fast if you use this module. For instance, load the dataset in memory directly when the container starts, this way the container will fail fast with an OOM error without it doing heavy lifting (this can cause tasks to delay because they need to be restarted a couple of times to work).

This way you can start your tasks with the baseline memory they need for 90% of the workloads. This module could be improved by for instance checking the memory profile of the application by using a log insight query:

We could analyse the delta of how the memory changes over time and determine if we need to double or quadruple the memory (for this feature container insights must be enabled).

It is important you add the tags when running the task itself:

Tag propagation does not work (from the task definition). You must set the tag when you launch the Fargate task.

You can download the module here.

PS: We need to use an CloudTrail event to get the request parameters when the task was invoked. We store the invocation data with the task ID in DynamoDB so we can re-run the invocation when needed. First I tried to use tags on the ECS task but unfortunately when the task stops the tags are purged from the task by AWS. If you already have CloudTrail Trails enabled then you can set the variable setup_cloud_trail to false.