Service outages, 9-14 July 2016
On Saturday 9 July, some users experienced a service outage on travis-ci.org. This consisted of a three hour period, beginning at approximately 14:30 UTC, in which some builds were not processed. This was followed by five hours of slowed scheduling performance as we cleared the backlog. By 23:04 UTC, the incident was fully resolved.
A very similar, slightly shorter outage also occurred on Thursday, 14 July, between approximately 16:00 and 02:30 UTC.
We know our users rely on Travis CI to be running smoothly at all times and we are very sorry for the trouble these outages will have caused.
What happened?
We are currently rolling out a rewritten version of the application that handles the data we receive from GitHub webhooks. Our rollout process means that, in theory, the number of users exposed to the rewritten application increases in small increments over time, while we closely monitor performance.
Sadly, this application left some database records in an unexpected state. This had severe consequences for another application in our system, which is responsible for scheduling the jobs for each build. As this application struggled to parse the bad data, other jobs were left waiting in a backed-up queue.
In between the two outages we found what we thought was the cause and released an update. After a couple of days without a repeat occurrence, we assumed that everything was fine. As it turned out, our initial diagnosis was incorrect.
Next steps
We are currently taking a number of steps to prevent this from repeating. Firstly, the rollout of the rewritten application was immediately paused. The rollout continues this week, after we are confident that we have made its handling of database records more robust.
Secondly, the job scheduling application has already been patched in order to ensure that it does not struggle under these specific circumstances again. We are also exploring different job scheduling strategies as a result of this investigation.
Lastly, but perhaps most importantly, we are looking into how we can be more promptly alerted to these kinds of issues. It took around three hours for us to tackle this outage in the first instance, mostly because it had evaded our alert system. We don’t ever want that to happen again.
Conclusion
These outages will have made things difficult for a lot of our users and we want to reiterate our apologies for the stress this will have caused. We are fully committed to always improving the performance and reliability of Travis CI.
Thanks for your understanding!