Major build outage: a postmortem report

Two weeks ago, on Tuesday 16th and Wednesday 17th January, we had a major outage affecting our open-source infrastructure (travis-ci.org). For a period of about 17 hours between 17:15 UTC on Tuesday and about 10:00 UTC on Wednesday, many builds were delayed, experienced errors, or cancelled outright. Some customers continued to experience disruption until around 19:00 UTC on Wednesday.

As far as we are currently aware, our travis-ci.com infrastructure for private repositories was not affected.

We know how frustrating it can be when Travis CI is unavailable, and so we want to take the time to share what we’ve found while investigating this incident, and the steps we’re taking to improve the reliability of our service.

Background

Travis CI’s infrastructure is composed of many different applications and services, each with their own responsibilities. Some services handle the dispatch of builds to the correct build infrastructure, others supervise the builds while they run, and others are responsible for aggregating and storing the logs generated by a build while it runs.

Many of these services communicate with one another through a system called a “message broker”. For example: when a job is running, the logs it generates are split into chunks, and each chunk is sent to the message broker and placed in a queue. Another service reads the chunks from the queue, joins them together and stores the complete log in a database.

The system we use as a message broker is RabbitMQ, and we will refer to it repeatedly in the timeline below.

What happened?

At around 13:45 UTC, for reasons we are continuing to investigate, the memory usage of one of the workers for our logs service started to grow. This worker was one of several responsible for retrieving log chunks from RabbitMQ. As it malfunctioned, the queue usually processed by this worker began to grow far beyond its usual size.

As this queue grew, RabbitMQ, which by default stores messages in RAM in order to improve throughput, rapidly exhausted its available memory. When this happened, it started to page messages to disk, a process which can block publishing to, and consuming from, its queues.

While the logs service soon recovered and partially cleared its queue, the same sequence repeated itself several times. The first two times it occurred, RabbitMQ’s queue processing was only suspended briefly. The third time it occurred, it seems that queue processing blocked for an extended period of time, and other queues unrelated to logs began to be affected.

At around 17:15 UTC, publishing halted on a different queue – one that is used to report the pass/fail/error result of a job on completion. This stalled the completion of a large proportion of running jobs. Jobs continued to start, but few finished, and automated systems responded by scaling up the number of available job runners. As each of these job runners opened connections to RabbitMQ, the message broker became further overloaded.

At 17:47 UTC, an automated alert notified our on-call engineers that there was a backlog of builds waiting to be processed. The most likely cause of the observed symptoms seemed to be an extraordinary influx of new build requests, which were presumed to be malicious. As build queues continued to grow, our engineers sought to identify and cancel malicious build requests. These cancellations also flowed through RabbitMQ, further increasing the load on the cluster.

Within two hours it had become apparent that there were not enough identifiably malicious builds being triggered to explain the backlogs, and suspicion began to fall on RabbitMQ due to failures in a number of systems connected to it.

Once it was clear that RabbitMQ was overloaded, the on-call engineers, with assistance from our vendor (we use a managed RabbitMQ service), decided to clear all queues and restart the cluster. Shortly after 02:00 UTC Jan 17th, RabbitMQ was restarted. In addition, our queues were reconfigured to be “lazy”, meaning that RabbitMQ no longer attempted to store the entire queue in memory.

Restarting our RabbitMQ cluster is not something we do often. While the restart did recover the health of the cluster, it put a number of components of Travis CI into a broken state, and it took us some time to identify and resolve these issues.

By 04:09 UTC, we began processing builds on our Linux container-based and Mac infrastructures.

At 08:11 UTC, after a shift hand-over to our European on-call engineers, we discovered that while many builds were successfully running, users would see their builds as queued. The initial suspect was a faulty connection to RabbitMQ. We fixed that, and build states began to be correctly recorded throughout our system. The Linux sudo: required infrastructure had failed to recover for the same reason and was restarted.

By 09:45 UTC, all infrastructures were functioning correctly. We cancelled all jobs that were stuck in the queued state as a result of our work overnight. At 11:18 UTC we marked the incident resolved.

Unfortunately, as the day progressed we started to receive reports of builds that were not being run. We opened a new incident and continued investigating. We discovered that our RabbitMQ queues still had a “purge” policy in place that we thought had been removed. The policy removal had only partly applied. As a result, any builds which couldn’t be processed immediately by our workers were being dropped by RabbitMQ and thus never run. As our infrastructure autoscaling is triggered by the build backlog in RabbitMQ, this policy also prevented our systems from scaling correctly.

We removed this lingering policy, finally bringing our systems back to normal operations. The second incident was closed at 20:42 UTC.

What have we learned?

In responding to and investigating this incident, we learned a great deal about the interactions between various parts of the Travis CI system. We have applied queue policies in RabbitMQ to make it far less likely that a backlog of the kind that started this incident results in an extended queue publishing halt and the consequent cascading failure that we experienced here. We are also investigating how to make the various systems that rely on RabbitMQ more tolerant of queue halts and general cluster availability problems.

We have identified a number of possible improvements to our monitoring systems which should make it easier for us to understand what is happening in our systems during cascading failure such as this one.

Even before this incident, we had been investigating new approaches to dispatching builds to our infrastructure that would allow us to reduce our reliance on a highly-available RabbitMQ cluster and improve the overall resilience of Travis CI. This work will receive renewed attention in light of what we learned in this incident.

Any questions?

We couldn’t be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.

We thank you for your continued support. We are working hard to make sure we live up to the trust you’ve placed in us and provide you with an excellent experience for your open source and private repository builds, as we know that our continuous integration and deployment tools are critical to your productivity.

If you have any questions or concerns that were not addressed in this post, please reach out to us via support@travis-ci.com and we’ll do our best to provide you with the answers.