The day we deleted our VM images

On August 9, we had a major incident on our Google Compute Engine setup. This was something more than just an outage. Our system wasn’t just down for an extended period of time, we permanently lost our VM images used for builds running on GCE.

We are deeply sorry for the outage and especially all the trouble the aftermath has caused our users.

Now that some time has passed, we would like to take another look at what happened and give you some insight into the aftermath.

What happened?

When you use the hosted version of Travis CI, whether it is travis-ci.com or travis-ci.org, your build jobs are running on one of three platforms:

  • A docker container inside an AWS EC2 instance. This is our default, and also where the largest portion of our builds are running. This setup also has the best boot times, thanks to docker.
  • A macOS or OS X VM on VMWare vSphere, running on dedicated hardware. Your builds run on this platform if you set the language to Objective-C or set the operating system to OS X.
  • A Linux VM running on Google Compute Engine. In this setup, the build runs directly on the VM instead of a docker container. Our default setup does not allow the use of sudo, nor does it support using docker inside your build script. If you enable either of these features in your build configuration, our system will pick the GCE based setup instead.

On GCE, we have limited storage for our VM images. To avoid running out of space, we have an automated cleanup service in place to delete images that have been removed from our internal image catalog service. You may already see where this is going.

Our last image update had been nine months ago, which made an update even harder. We had new images in place, and had some early adopters using them, but the images hadn’t gone through full QA testing yet.

We were working on an improved image provisioning process to make continuous updates easier. This process is running on Travis CI itself. During development, we had started publishing many many more images than we had in the previous 5-6 months.

In addition, our cleanup service had been briefly disabled to troubleshooting a potential race condition. Then we turned the automated cleanup back on. The service had a default hard coded amount of how many image names to query from our internal image catalog and it was set to 100.

When we started the cleanup service, the list of 100 image names, sorted by newest first, did not include our stable images, which were the oldest, did not get included in the results. Our cleanup service then promptly started deleting the older images from GCE, because its view of the world told it that those older images where no longer in use, i.e it looked like they were not in our catalog and all of our stable images got irrevocably deleted.

This immediately stopped builds from running. After identifying the issue, we realized that recreating images exactly as before was unachievable, due to the way our provisioning used to work when we built the images almost a year ago.

We therefore decided to roll forward to the available development images. This allowed us to have builds up and running approximately 90 minutes after the VM images had been deleted. Less than two hours later, we had made it through the backlog caused be the outage.

What was the impact for our customers?

A 90 minute, unscheduled outage is something we try our best to avoid. From large enterprises to hobby, side projects, proprietary and open source projects, our users rely on Travis CI being available and reliable when it comes to testing and deploying their software.

However, such an outage would probably not warrant a third blog post, almost two months later.

The main impact for our users and for us was the move to the new, not well-tested images. It was great for some, as the preinstalled software on our VMs was nine months old. Many users had to update software as part of their build runs, adding to their overall build time.

But it broke the builds for many. For instance, the first version of our new VMs did ship without docker-compose. Running docker inside of the VM is one of the main reasons for our users to switch from our EC2 setup.

the day we deleted our vm images
Spike in emails sent to support@travis-ci.com.

At this point, breaking builds and giving users a bad experience was unavoidable. So we deemed it most important to address any issues as soon as possible. We therefore decided to halt feature development for more than a week. Instead, we invoked all hands support. Apart from any urgent fixes and rolling out VM image updates, every developer was now on customer support.

What are we doing about it?

Counter measures we have already in place:

  • Reproducible image provisioning process. This allows us to rebuild an exact image easily if it should be deleted. While we’re also putting other mechanisms in place, we found this to be faster than restoring from a backup.
  • More frequent image updates. The main reason this had such a major impact was that we hadn’t updated images in nine months.
  • Checks to make sure we don’t delete production images. Our system now has checks in multiple places to automatically prevent such deletions in the future.
  • Automated testing of our VM images. We have created an automated test suite to run against our images. This will make sure that certain software is installed and set up properly on the images. We will keep improving these tests.
  • Disable automatic cleanup. For now, no automatic VM image cleanup is running.

Planned and currently worked on steps:

  • Unified image release process. We are working on a new image rollout process, which hopefully will also communicate to our users better when and what updates we’re rolling out. We also want to make it easier for users to configure whether they want updates early or not. Expect updates on this in the near future.
  • Improved incident handling. Internally, we are putting a new on call schedule in place starting October 24. This includes a thorough explanation of incident handling procedures and responsibilities and has been greatly influenced by this incident.

We know this has not been a great experience for our users, and we know that in hindsight this could have been prevented. Again, we cannot apologize enough for the inconvenience and would like to thank everyone for their patience while we cleaned up the fallout.

If you have any further questions for feedback, send us an email.