Reduce AWS Fargate pull times with SOCI

One of the major drawbacks of AWS Fargate is that the pull times are relatively slow (compared to EC2). This is because EC2 nodes can have a local image cache on the instance. Fargate is serverless compute so does not offer this cache.

Having slow pull times can be troublesome if you have spiky loads. If it takes 60 seconds for the container to launch and 10 seconds for the health-checks of the load balancer to pass you will have 70 seconds from the scaling event (which in case of target tracking can be another 3 minutes). Faster pull times allow your application to scale quicker to meet demand.

There have been some improvements in this space. Notably SOCI (Seekable OCI). This is an open source technology by AWS that stores an image index in the repository. Allowing for faster pull times in Fargate.

I did some benchmarking with an PHP container of 413 MB. Just running a simple artisan command line and benchmarking the start and creation time (so there is no load balancer involved, this could change the benchmarks due to the health checks).

Pulling without SOCI

Starting the container from in a private subnet (with a NAT instance):

Creation timeStart timeDuration
2024-04-10T12:33:42.190Z2024-04-10T12:34:43.687Z61 seconds
2024-04-10T12:33:39.054Z2024-04-10T12:34:38.749Z59 seconds
2024-04-10T12:33:42.233Z2024-04-10T12:34:35.717Z53 seconds
2024-04-10T12:33:40.833Z2024-04-10T12:34:39.173Z59 seconds
2024-04-10T12:33:39.402Z2024-04-10T12:34:41.113Z62 seconds
2024-04-10T12:33:41.342Z2024-04-10T12:34:36.574Z55 seconds
2024-04-10T12:33:41.342Z2024-04-10T12:34:36.574Z55 seconds
2024-04-10T12:33:39.964Z2024-04-10T12:34:35.209Z56 seconds
2024-04-10T12:33:37.956Z2024-04-10T12:34:33.268Z56 seconds
2024-04-10T12:33:38.655Z2024-04-10T12:34:31.864Z53 seconds

Pulling without SOCI (using VPC endpoints)

Initially I thought that I was able to get faster pull times by setting up a Gateway S3 endpoint (which you should have by default) and VPC endpoints (ecr.drk and ecr.api). However it turns out this is not the case:

Creation timeStart timeDuration
2024-04-10T12:23:47.643Z2024-04-10T12:24:42.853Z55 seconds
2024-04-10T12:23:50.045Z2024-04-10T12:24:41.766Z51 seconds
2024-04-10T12:23:53.436Z2024-04-10T12:24:49.033Z56 seconds
2024-04-10T12:23:51.992Z2024-04-10T12:24:50.407Z59 seconds
2024-04-10T12:23:50.430Z2024-04-10T12:24:49.354Z59 seconds
2024-04-10T12:23:50.759Z2024-04-10T12:24:43.958Z53 seconds
2024-04-10T12:23:52.621Z2024-04-10T12:24:54.872Z62 seconds
2024-04-10T12:23:52.580Z2024-04-10T12:24:48.212Z56 seconds
2024-04-10T12:23:51.428Z2024-04-10T12:24:48.406Z57 seconds
2024-04-10T12:23:49.307Z2024-04-10T12:24:37.515Z48 seconds

We can verify that the VPC endpoint is indeed being used by running a traceroute from inside the VPC:

This session is encrypted using AWS KMS.
sh-5.2$ traceroute 533267114484.dkr.ecr.eu-west-1.amazonaws.com
traceroute to 533267114484.dkr.ecr.eu-west-1.amazonaws.com (10.0.43.48), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *

The IP address 10.0.43.48 corresponds with the VPC endpoint IP address:

VPC endpoint IP address

Pulling containers using SOCI

Now we setup SOCI by setting up the following CloudFormation template. This will monitor your ECR repositories for pushes, and create the SOCI index in the ECR repository automatically.

There is an alternative way to set up SOCI (for your CI/CD pipeline) but they are outside the scope of this article. After setting up the CloudFormation template, push an image to ECR to see two new artifact types appear after the image has been pushed to the repo. A SOCI Index and an Image Index:

A SOCI Index and an Image Index

Make sure your Fargate platform version is 1.4.0 or higher. We can now pull the containers again from Fargate (no further changes are needed):

Creation timeStart timeDuration
2024-04-10T12:15:49.682Z2024-04-10T12:16:10.607Z21 seconds
2024-04-10T12:15:47.849Z2024-04-10T12:16:08.313Z21 seconds
2024-04-10T12:15:48.874Z2024-04-10T12:16:08.970Z20 seconds
2024-04-10T12:15:48.874Z2024-04-10T12:16:10.845Z22 seconds
2024-04-10T12:15:49.308Z2024-04-10T12:16:11.618Z22 seconds
2024-04-10T12:15:52.370Z2024-04-10T12:16:13.479Z21 seconds
2024-04-10T12:15:48.558Z2024-04-10T12:16:10.845Z22 seconds
2024-04-10T12:15:51.273Z2024-04-10T12:16:10.480Z19 seconds
2024-04-10T12:15:52.169Z2024-04-10T12:16:20.487Z28 seconds
2024-04-10T12:15:50.144Z2024-04-10T12:16:15.148Z25 seconds

Pulls with SOCI enabled are 40% faster (average 22 seconds)

I expect that with even bigger containers the results will be even more impressive. AWS has a blog post with a container of 1333MB and their results are 50% faster pull times. Note that SOCI is only worth it if your container is larger than 250 MB in size.

This is a notable difference, especially for services scaling behind a load balancer. If we assume you configured your load balancer with two successful health checks with a interval of 5 seconds. This means that:

  • Without using SOCI your container would be accepting requests after 10 seconds + 57 seconds = 67 seconds
  • Using SOCI your container would be accepting requests after 10 seconds + 22 seconds = 32 seconds

This can make all the difference when you have spiky workloads. Especially if you are using target tracking it can take 3 minutes for the target tracking alarm to activate. Having faster pull times can allow you to respond more quickly to load.

To stop using SOCI indexes, simply delete the SOCI Index and the Image Index from the ECR repository (it only works with private repositories currently).

Some more tips to improve your container pull times:

  1. Never use Docker Hub images, they will be pulled through the NAT gateway or NAT instances causing data transfer, and very slow pull times, not to mention they are flaky due to Docker Hub rate limits. Set up pull through cache rules with ECR and Docker Hub
  2. To improve scalability behind a load balancer set the minimum interval for health checks to 5 seconds and an healthy count of 2
  3. Reduce the size of your image by basing it on alpine, or when you are using PHP extensions use this excellent open source library that also cleans up the layers after the build
  4. Separate your workloads, ie. for PHP use the php-cli base image for workers and the php-apache base image for web server containers (the web server container will be larger)

If you need single digit pull times, you can use EC2 ECS instances, they can have an image cache so they can start containers almost instantly. You must then maintain and scale the EC2 instance. Or, you must upvote this roadmap item for AWS and hope they introduce a image cache for Fargate!