Note: This postmortem will also be posted on the relevant status incident.
We strive to provide the most stable and user-friendly CI platform possible so that you and your teams can focus on shipping amazing open source and commercial software. When any portion of our service is unavailable, we know it can bring your productivity to a screeching halt. As developers building a tool for other developers, we understand firsthand how frustrating and debilitating this can be.
We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for macOS building and testing. This is not at all acceptable to us. We are very sorry that it happened, we are very conscious of the fact that our macOS infrastructure has had ongoing stability and backlog issues, and we are very close to putting into production new infrastructure improvements to ensure a higher level of reliability going forward.
The following is a timeline of the events during this outage.
Note: All times are in UTC timezone.
Feb 01, 2017 - 01:58 UTC: macOS queues for both public and private repos were backed up. We begin working with our macOS infrastructure provider to identify contributing factors.
Feb 01, 2017 - 02:27 UTC: We began stopping all job throughput to prevent runaway VM leakage while waiting for further insights from our upstream infrastructure provider.
Feb 01, 2017 - 02:52 UTC: Some misbehaving hosts were restarted, thanks to help from our upstream provider. We started bringing job processing capacity back online.
Feb 01, 2017 - 04:51 UTC: The underlying VM infrastructure remained unstable, so we continued coordinating with our infrastructure provider to perform a full restart of the entire underlying infrastructure.
Feb 01, 2017 - 06:45 UTC: The virtualization platform was fully restarted and we began bringing job processing capacity back online.
Feb 01, 2017 - 07:11 UTC: Restarting the platform did not resolve all issues, and we resumed digging into the sources of instability.
Feb 01, 2017 - 09:30 UTC: We identified connectivity issues in our macOS workers and stopped all macOS builds to further investigate and fix them.
Feb 01, 2017 - 14:07 UTC: We made the difficult decision to proceed with cancelling all pending macOS builds on travis-ci.org. In part to redouce the impact on Linux builds throughput and to begin running new builds for users.
Feb 01, 2017 - 14:33 UTC: We were continuing to work on fixing the connectivity issue preventing us restarting macOS builds processing on both travis-ci.com and travis-ci.org.
Feb 01, 2017 - 15:00 - 17:30 UTC: We provided regular updates as we continued to work on fixing the connectivity issues.
Feb 01, 2017 - 17:47 UTC: We were in the process of testing further patches to skip jobs older than 6 hours in order to help with the massive backlog.
Feb 01, 2017 - 18:00 - 17:00 UTC: Additional testing was required before we could resume running any builds.
Feb 01, 2017 - 19:07 UTC: We are now running at reduced job processing capacity in production for both public and private repos.
Feb 01, 2017 - 19:33 UTC: We increased capacity in production for both public and private repos. Due to ongoing issues with our DHCP setup, we were still limited less than full capacity.
Feb 01, 2017 - 20:38 UTC: Our macOS infrastructure was processing builds normally for both travis-ci.org and travis-ci.com, albeit at a reduced capacity. We continued working on fixing our DHCP issues to be able to restore the full capacity.
Feb 01, 2017 - 21:36 UTC: The private repo backlog dropped steadily over the past hour, and we expected it to be caught up in less than 90 minutes.
Feb 01, 2017 - 22:49 UTC: We saw the backlog level off during peak usage hours.
Feb 02, 2017 - 00:13 UTC: The backlog for private repos was still dropping; now below 150.
Feb 02, 2017 - 01:18 UTC: The backlog for private repos was still dropping; now below 50.
Feb 02, 2017 - 01:35 UTC: Issue Resolved
The major contributing factors in this outage were
- Multiple vSphere hosts becoming unavailable resulted in strain on the whole system and caused a portion of new VM creation to start failing. This caused a churn of requeues of build jobs, which kept adding more strain to the entire virtualization platform.
- Unexpected corruption of one of the pair of hosts that provide NAT and DHCP for our build VM network resulted in complete configuration loss of the other host. This led to us needing to move those services to a different component in our stack, while we rebuilt the corrupt hosts from scratch.
- Existing limitations in how the core of our scheduling backend works meant that a backlog of macOS jobs blocked new Linux builds from starting and running.
- We'll be sharing more details in a future blog post, but we've invested in building out a sharded virtualization infrastructure and we'll be migrating our macOS builds to this new infrastructure in the near future. This will give us more fault tolerance and let us spread out load across more isolated components.
- We are investing in a newer hardware platform for the vSphere hosts, which will be able to handle load better and should result in improvements in overall build performance.
- We identified and made a small set of hot fixes during the outage which has already improved our ability to better handle the kind of failure scenario we saw and reduce the amount of job requeues that happen during this type of outage. We are discussing some further areas of improvement we can make to the key backend services that interact with our macOS virtualization platform.
- We will be improving our monitoring of the build VM NAT/DHCP compoents, to more quickly detect when this component is in a failure state.
- We are looking at how we can improve our scheduling to better isolate things so a macOS backlog does not impact Linux builds so dramatically.
We couldn't be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.
We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you've placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.
If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via email@example.com and we'll do our best to provide you with the answers to your questions or concerns.