API and logs outage on travis-ci.org, Wed 1 March

From approximately 05:00 UTC on 1 March to 04:00 UTC on 2 March 2017, users of travis-ci.org experienced partial availability of the API and web interface, and delayed logs.

This post explains exactly what happened, how we dealt with the incident and what we plan to do to improve our service in these areas.

What happened

On Tuesday, 28 February 2017, Amazon Web Services (AWS) suffered a service outage in the us-east-1 region for approximately five hours.

The first part of AWS affected was Simple Storage Service (S3). Eventually, other products from AWS were also hit, including Elastic Compute Cloud (EC2) and Relational Database Service (RDS).

One of our upstream service providers, Heroku, was affected by this incident. Heroku’s health check system detected an unresponsive PostgreSQL database – the one we use for log storage. Whether the AWS outage impacted this particular database or not is still unclear, with no indication in the PostgreSQL logs or syslog, but the timing of the incident would suggest so. Since the database was unresponsive, we promoted a follower database which was already being prepared for future release.

Unfortunately, the machine hosting the follower database was running a version of the Ubuntu Xenial kernel with a bug that causes processes which use a lot of memory to be killed prematurely. This bug caused our logs database process to be frequently killed and restarted.

Consequently, travis-api and travis-logs lost their connections to the logs database and needed to be regularly restarted. The impact for users of travis-ci.org was that:

  • Our API was partially unavailable
  • Our web interface, as the biggest consumer of the API, was also partially unavailable
  • The RabbitMQ service we use to process logs was often backed up, and took some time to drain, resulting in logs being unavailable for brief but repeated periods

Steps taken

Throughout this period, we sought to apply small changes to our application environments that would decrease the need for application restarts. We also worked closely with Heroku to work out a solution that would get our logs database back in working order.

Initially, Heroku’s path forward for us was to provision a new follower that we would fail over to. Due to the continued instability of the existing primary, a new base backup, which is necessary for building a follower, was not able to be successfully made. We discussed alternative plans with Heroku and eventually agreed that a small amount of downtime, in order to apply an in-place kernel update to the live primary database, was now necessary.

At 05:45 UTC on 2 March, our API and logs applications were stopped. They came back online around 20 minutes later, after the kernel patch had been applied. We continued to monitor the stability of our applications for a number of hours, before considering the incident resolved at 09:49 UTC.

Conclusions

We would firstly like to extend our warmest thanks to the folks at Heroku. Despite having their own issues to deal with, they were quick, friendly and helpful throughout.

As a result of this outage, we are prioritising work that was already underway to remove the direct logs database connections from some of our applications. They will instead use an HTTP API which is currently being built into travis-logs. Where this is not possible, we will introduce automatic repair when the database connection is lost.

Over the weekend, we created a new follower database with an updated kernel for travis-ci.com. We successfully promoted this follower to the primary database at around 03:10 UTC on Sunday, 5 March.

We will also review our database failover setup to explore additional redundancy wherever we can.

We are extremely grateful for the messages of support we received on Twitter during this outage – we couldn’t ask for better users. We continue to seek ways to improve our reliability for you. Thank you for your patience!