Manually fix your Terraform statefile in case of emergencies

The golden rule of infrastructure as code is not to change the infrastructure manually. However manual changes can happen by accident. Leaving the infrastructure in an inconsistent state.

In this example I will go through the steps to manually repair your statefile if this happens.

Consider the following example:

resource "cloudflare_zone" "example" {
  account_id = var.cloudflare_account_id
  zone       = "some-random-domain-to-test.com"
}

When setting this infrastructure up through Terragrunt this will automatically create a S3 bucket for the state and a DynamoDB table to track the lock for the state file.

If we simulate a lost state file by deleting the state file from S3 and the lock row from DynamoDB we can simulate the problem of a manual change in Cloudflare where the domain already has some state in Cloudflare.

I am using an external provider like Cloudflare. They can have crashes during the plan phase. Because they are reliant on API’s that were not made for infrastructure automation in the first place.

We now get the following error after approving the apply:

This is a provider error, meaning that the provider crashed before it could run the apply.

Advised solution

We need to sync Terraform with the current situation to make the plan succeed. We can do this by importing the resource into our state file by using the terraform import command.

For example to import the zone:

terraform import cloudflare_zone.example d8ad45e367bf09edeb16d55d61d70497

In most cases this should solve the issue and solve the crash of the provider.

Break glass solution

In some cases, there can be an issue where the provider crashes when inquiring about the resource. For instance in an earlier version of the Cloudflare Terraform library, if you changed a record in Cloudflare this would result in a new ID for that record.

However, Terraform would still have the old record ID in it’s state file and you could not get past a Record not found (1061) error. The provider would just crash before you could even correct the issue. Also in some cases it can be impossible to run a simple terraform import if you are using a managed pipeline to deploy your infrastructure. In addition, not all resources are importable.

In this case, the only solution you have left is to manually change the ID of the record in the state file.

To do this, execute the following steps:

  1. Make a backup of your state file from S3 to your computer and store it somewhere safe (you should always have versioning on your S3 state bucket!)
  2. Download the state file to your computer
  3. Open the state file on your computer
  4. Replace the id in the JSON state file with the correct ID (you can find this through the API of the provider)
  5. Create a MD5 hash of the contents of the new file
  6. Change the DynamoDB lock table to reflect this new MD5 hash

After that you can re-run the plan and the apply.