One of the major drawbacks of AWS Fargate is that the pull times are relatively slow (compared to EC2). This is because EC2 nodes can have a local image cache on the instance. Fargate is serverless compute so does not offer this cache.
Having slow pull times can be troublesome if you have spiky loads. If it takes 60 seconds for the container to launch and 10 seconds for the health-checks of the load balancer to pass you will have 70 seconds from the scaling event (which in case of target tracking can be another 3 minutes). Faster pull times allow your application to scale quicker to meet demand.
There have been some improvements in this space. Notably SOCI (Seekable OCI). This is an open source technology by AWS that stores an image index in the repository. Allowing for faster pull times in Fargate.
I did some benchmarking with an PHP container of 413 MB. Just running a simple artisan command line and benchmarking the start and creation time (so there is no load balancer involved, this could change the benchmarks due to the health checks).
Pulling without SOCI
Starting the container from in a private subnet (with a NAT instance):
Creation time | Start time | Duration |
---|---|---|
2024-04-10T12:33:42.190Z | 2024-04-10T12:34:43.687Z | 61 seconds |
2024-04-10T12:33:39.054Z | 2024-04-10T12:34:38.749Z | 59 seconds |
2024-04-10T12:33:42.233Z | 2024-04-10T12:34:35.717Z | 53 seconds |
2024-04-10T12:33:40.833Z | 2024-04-10T12:34:39.173Z | 59 seconds |
2024-04-10T12:33:39.402Z | 2024-04-10T12:34:41.113Z | 62 seconds |
2024-04-10T12:33:41.342Z | 2024-04-10T12:34:36.574Z | 55 seconds |
2024-04-10T12:33:41.342Z | 2024-04-10T12:34:36.574Z | 55 seconds |
2024-04-10T12:33:39.964Z | 2024-04-10T12:34:35.209Z | 56 seconds |
2024-04-10T12:33:37.956Z | 2024-04-10T12:34:33.268Z | 56 seconds |
2024-04-10T12:33:38.655Z | 2024-04-10T12:34:31.864Z | 53 seconds |
Pulling without SOCI (using VPC endpoints)
Initially I thought that I was able to get faster pull times by setting up a Gateway S3 endpoint (which you should have by default) and VPC endpoints (ecr.drk and ecr.api). However it turns out this is not the case:
Creation time | Start time | Duration |
---|---|---|
2024-04-10T12:23:47.643Z | 2024-04-10T12:24:42.853Z | 55 seconds |
2024-04-10T12:23:50.045Z | 2024-04-10T12:24:41.766Z | 51 seconds |
2024-04-10T12:23:53.436Z | 2024-04-10T12:24:49.033Z | 56 seconds |
2024-04-10T12:23:51.992Z | 2024-04-10T12:24:50.407Z | 59 seconds |
2024-04-10T12:23:50.430Z | 2024-04-10T12:24:49.354Z | 59 seconds |
2024-04-10T12:23:50.759Z | 2024-04-10T12:24:43.958Z | 53 seconds |
2024-04-10T12:23:52.621Z | 2024-04-10T12:24:54.872Z | 62 seconds |
2024-04-10T12:23:52.580Z | 2024-04-10T12:24:48.212Z | 56 seconds |
2024-04-10T12:23:51.428Z | 2024-04-10T12:24:48.406Z | 57 seconds |
2024-04-10T12:23:49.307Z | 2024-04-10T12:24:37.515Z | 48 seconds |
We can verify that the VPC endpoint is indeed being used by running a traceroute from inside the VPC:
This session is encrypted using AWS KMS. sh-5.2$ traceroute 533267114484.dkr.ecr.eu-west-1.amazonaws.com traceroute to 533267114484.dkr.ecr.eu-west-1.amazonaws.com (10.0.43.48), 30 hops max, 60 byte packets 1 * * * 2 * * * 3 * * * 4 * * *
The IP address 10.0.43.48 corresponds with the VPC endpoint IP address:
VPC endpoint IP address
Pulling containers using SOCI
Now we setup SOCI by setting up the following CloudFormation template. This will monitor your ECR repositories for pushes, and create the SOCI index in the ECR repository automatically.
There is an alternative way to set up SOCI (for your CI/CD pipeline) but they are outside the scope of this article. After setting up the CloudFormation template, push an image to ECR to see two new artifact types appear after the image has been pushed to the repo. A SOCI Index and an Image Index:
A SOCI Index and an Image Index
Make sure your Fargate platform version is 1.4.0 or higher. We can now pull the containers again from Fargate (no further changes are needed):
Creation time | Start time | Duration |
---|---|---|
2024-04-10T12:15:49.682Z | 2024-04-10T12:16:10.607Z | 21 seconds |
2024-04-10T12:15:47.849Z | 2024-04-10T12:16:08.313Z | 21 seconds |
2024-04-10T12:15:48.874Z | 2024-04-10T12:16:08.970Z | 20 seconds |
2024-04-10T12:15:48.874Z | 2024-04-10T12:16:10.845Z | 22 seconds |
2024-04-10T12:15:49.308Z | 2024-04-10T12:16:11.618Z | 22 seconds |
2024-04-10T12:15:52.370Z | 2024-04-10T12:16:13.479Z | 21 seconds |
2024-04-10T12:15:48.558Z | 2024-04-10T12:16:10.845Z | 22 seconds |
2024-04-10T12:15:51.273Z | 2024-04-10T12:16:10.480Z | 19 seconds |
2024-04-10T12:15:52.169Z | 2024-04-10T12:16:20.487Z | 28 seconds |
2024-04-10T12:15:50.144Z | 2024-04-10T12:16:15.148Z | 25 seconds |
Pulls with SOCI enabled are 40% faster (average 22 seconds)
I expect that with even bigger containers the results will be even more impressive. AWS has a blog post with a container of 1333MB and their results are 50% faster pull times. Note that SOCI is only worth it if your container is larger than 250 MB in size.
This is a notable difference, especially for services scaling behind a load balancer. If we assume you configured your load balancer with two successful health checks with a interval of 5 seconds. This means that:
- Without using SOCI your container would be accepting requests after 10 seconds + 57 seconds = 67 seconds
- Using SOCI your container would be accepting requests after 10 seconds + 22 seconds = 32 seconds
This can make all the difference when you have spiky workloads. Especially if you are using target tracking it can take 3 minutes for the target tracking alarm to activate. Having faster pull times can allow you to respond more quickly to load.
To stop using SOCI indexes, simply delete the SOCI Index and the Image Index from the ECR repository (it only works with private repositories currently).
Some more tips to improve your container pull times:
- Never use Docker Hub images, they will be pulled through the NAT gateway or NAT instances causing data transfer, and very slow pull times, not to mention they are flaky due to Docker Hub rate limits. Set up pull through cache rules with ECR and Docker Hub
- To improve scalability behind a load balancer set the minimum interval for health checks to 5 seconds and an healthy count of 2
- Reduce the size of your image by basing it on alpine, or when you are using PHP extensions use this excellent open source library that also cleans up the layers after the build
- Separate your workloads, ie. for PHP use the php-cli base image for workers and the php-apache base image for web server containers (the web server container will be larger)
If you need single digit pull times, you can use EC2 ECS instances, they can have an image cache so they can start containers almost instantly. You must then maintain and scale the EC2 instance. Or, you must upvote this roadmap item for AWS and hope they introduce a image cache for Fargate!