Build VMs boot failure on the sudo-enabled infrastructure: incident postmortem

On Monday, September 10 our sudo-enabled infrastructure experienced an outage which caused delays or even prevented new build VMs from being created. This lasted approximately three hours and affected both public and private repositories. It had no impact, however, on jobs running in our container-based infrastructure or macOS jobs.

We know how frustrating it can be when Travis CI is unavailable, and so we want to take the time to share what we’ve found while investigating this incident, and the steps we’re taking to improve the reliability of our service.

Background

When your build configuration specifies sudo: required (or you are using the Docker service), Travis CI will run it in an isolated Google Compute Engine virtual machine. These builds are managed by our worker agents which run in the same infrastructure and control the creation and clean-up of build VMs. Since this is a public cloud offering, it also means that Google enforces quotas on CPUs, network use and number of instances.

What happened?

At around 17:13 UTC we were alerted that the backlog of jobs on GCE was growing and immediately noticed that this correlated with a high number of jobs being requeued and our actual capacity for processing them dropping.

Our logs showed that workers were crashing due to an unhandled error in one of the functions responsible for instance creation. After a worker crash, the currently running jobs are now orphaned and not cleaned up immediately. These stray VMs are eventually cleaned up, but this can take up to 3 hours. The process was slow enough to make us reach the quotas enforced by the Google Cloud Platform and prevent us from creating new instances.

By 17:38 UTC, we had a patch ready to prevent worker crashes and began rolling it out. In addition to the worker upgrades, we also made sure we cleaned up all the non-terminated job VMs that were leaked by the crashes. Patching and restarting the workers took until 20:58 UTC, at which time the systems were back to normal and the incident was closed.

Upon further investigation, it was revealed that the source of the issue was a brief outage in Google Cloud’s service, which triggered the error our code was unprepared for. The error caused our workers to crash, leak VMs and re-deliver existing jobs to other workers, amplifying traffic and instance creation. This lead to more crashes, timeouts and eventually to us hitting quota limits on GCP.

What have we learned?

In responding to and investigating this incident, we learned that we were not very well prepared to deal with cascading worker failures. It became very clear that we need to invest in better cleanup for VMs and better resiliency to crashes. We have been discussing this internally for a long time, and are now committed to implementing a solution for resumable jobs - this means that if a worker crashes, all its in-progress jobs can be picked up and finished by a different worker instead of being requeued and restarted.

We have also identified a few possible improvements to our monitoring systems which should make it easier and faster for us to understand what is happening during these kinds of situations.

Any questions?

We couldn’t be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this is no exception.

We thank you for your continued support. We are working hard to make sure we live up to the trust you’ve placed in us and provide you with an excellent experience for your open source and private repository builds, as we know that our continuous integration and deployment tools are critical to your productivity.

If you have any questions or concerns that were not addressed in this post, please reach out to us via support@travis-ci.com and we’ll do our best to provide you with the answers.