Sept 6 - 11 macOS outage postmortem
Note: This postmortem will also be posted on the relevant status incident
We strive to provide the most stable and user-friendly CI platform possible so that you and your teams can focus on shipping amazing open source and commercial software. When any portion of our service is unavailable, we know it can bring your productivity to a screeching halt. As developers building a tool for other developers, we understand firsthand how frustrating and debilitating this can be.
We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for macOS building and testing. This is not at all acceptable to us. We are very sorry that it happened.
The following is a timeline of the events during this outage.
Note: All times are in UTC timezone.
- Sep 07, 2017 - 09:36 UTC: Repositories running on travis-ci.org and travis-ci.com are experiencing an increase in errored builds. We are investigating and will update as soon as we can.
- Sep 08, 2017 - 08:56 UTC: The stability of our macOS builds seems to have improved. We will continue to monitor the rate of errored builds.
- Sep 08, 2017 - 10:32 UTC: We continue investigating mac OS requeues and build timeouts for both public and private repositories. This seems to be related to SAN performance, we’ll continue posting updates as we work to get a more stable performance.
- Sep 08, 2017 - 11:07 UTC: We’ve identified an issue with some of our Xcode image hosts, causing macOS requeues on both public and private repositories. We’re working together with our upstream provider to sort this out while we continue investigating macOS build timeouts.
- Sep 08, 2017 - 12:34 UTC: We are stopping all mac OS jobs because we have run out of space on our data center’s SAN.
Sep 08, 2017 - 15:01 UTC: We’re currently working with our infrastructure provider to reboot one of our vCenter instances to work out unresponsive SAN issues. Mac OS jobs for public and private repositories builds are stopped.
- Sep 08, 2017 - 16:52 UTC: We’ve rebooted our vCenters and continue to work on stabilizing things. All macOS builds remain stopped.
- Sep 08, 2017 - 19:52 UTC: We’re continuing to work on getting things into a stable state where we can potentially start running builds. At the moment we do not have an ETA for when we will resume builds. We are very sorry for the delays and will update this incident when we know more. Thank you for your patience.
- Sep 08, 2017 - 21:29 UTC: We’re working on stabilization cleanup for our SAN storage. At the moment we do not have an ETA for when we will resume builds. We are very sorry for the delays and will update this incident when we know more. Thank you for your patience.
- Sep 09, 2017 - 01:04 UTC: In order to help things become stable and reliable going forward, we’re undertaking intense cleanup of our SAN filesystem. This cleanup is likely to take all weekend. Because of this, we’re only able to resume a portion of our capacity for private builds and will not be resuming shared public builds yet. We do not currently have an ETA for when we’ll be able to resume shared public builds. We will provide our next update in the morning PDT. We are very sorry for the delays and will update this incident when we know more. Thank you for your patience.
- Sep 09, 2017 - 01:28 UTC: We ran into an issue with booting Xcode 8.x images, so all builds are suspended again. We’ll update when private builds are running.
- Sep 09, 2017 - 03:13 UTC: We’ve resumed running private builds at this time. We’ll provide further updates on the overall progress tomorrow morning PDT. Thank you for your patience.
- Sep 09, 2017 - 13:04 UTC: The backlog for private repository builds has been clear for ~4h. We are planning to bring partial capacity for public repositories back online shortly.
- Sep 09, 2017 - 14:32 UTC: Capacity for macOS public repositories has been back online for ~1 hr. We’re bumping additional capacity to work through the backlog.
- Sep 09, 2017 - 15:11 UTC: We’re continuing to process the public backlog while running SAN cleanup. We may still need to reduce or suspend public builds later in the weekend, depending on SAN progress. Thank you for your patience.
- Sep 10, 2017 - 16:45 UTC: We’ve processed a backlog of approximately 9,600 macOS jobs for public repositories since re-enabling public macOS builds at 07:00 PDT yesterday. As we’re still at reduced capacity and working on cleaning the SAN, we still have a backlog of ~150-200 jobs and continue to actively process them. We’ll provide updates as things progress today. Thank you for your patience.
- Sep 10, 2017 - 19:45 UTC: We temporarily have additional reduced capacity for public builds, as we take some actions to continue with our SAN cleanup. We’ll provide another update when that capacity has been restored.
- Sep 10, 2017 - 19:56 UTC: We’re now running with the previous capacity for public builds, which is still reduced from our “normal” capacity. We are continuing with SAN cleanup. We’ll provide updates as things progress today. Thank you for your patience.
- Sep 11, 2017 - 03:45 UTC: We’ve completed the first phase of our SAN cleanup. Things are stable and so we’re working to resume full public macOS build capacity. We’ll provide another update when that’s complete.
- Sep 11, 2017 - 03:57 UTC: We’ve resumed full build capacity for public builds. We will be monitoring things overnight and will provide further updates in the morning PDT. Thank you for your patience.
- Sep 11, 2017 - 15:47 UTC: We’re seeing some instability with some of the private macOS build capacity and so we’re reducing capacity temporarily.
- Sep 11, 2017 - 16:50 UTC: We’re resuming full private macOS build capacity.
- Sep 11, 2017 - 19:52 UTC: The backlog has cleared for private builds. We are continuing to monitor the situation for public/open source builds. Thanks for hanging in there with us.
- Sep 11, 2017 - 21:11 UTC: The public macOS build backlog has reached normal peak levels and things are remaining stable. We’re closing the incident at this time. A postmortem blog post will be published in the next few days and we’ll share it on Twitter when it’s published. Thank you everyone for your patience and understanding during this extended incident. Incident is Resolved.
The major contributing factors in this outage were
- Ongoing performance limitations with our current shared SAN platform were leading to hypervisor host instability.
- SAN storage got too full to be able to keep up with our I/O demands. We rely on the thin provisioning feature of the SAN storage to be able to quickly and efficiently launch Build VMs. Over time the amount of used space on our SAN LUNs had grown. We had initially attributed this to our overall growth in usage and concurrent load, and realized too late that due to the host instability mentioned above, we were accumulating orphaned folders/files on the SAN that were not visible to existing cleanup processes. This reduction in free space hit a point where there was too little space for the SAN to be able to keep up with our peak load.
- A partial outage and reduced capacity was required to be able to clean up these orphaned folders/file in a timely manner.
- We’re currently working with our infrastructure provider on a new dedicated SAN platform which will bring dedicated capacity with faster storage and increased storage network throughput.
- Through the cleanup process, we developed a set of scripts which were able to help us identify and cleanup orphaned files/folders. While we anticipate that improvements from the new SAN platform will be sufficient enough to prevent the accumulation of orphaned files/folders in the future, we plan to utilize some of these scripts to do reporting and generate metrics to measure and verify that the new SAN platform brings the improvements we expect.
- There will also be some forthcoming announcements about other product changes we may be implementing to help manage the overall travis-ci.org macOS backlog.
We couldn’t be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.
We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you’ve placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.
If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via firstname.lastname@example.org and we’ll do our best to provide you with the answers to your questions or concerns.