As many of you are aware, Travis CI started as an open source software project, and was initially conceived as a platform to run free CI builds for open source projects. As demand for the service grew, and we began offering hosted services for paying customers, we always continued serving the open source community as well. We’re extremely proud of what we’ve built, and even prouder of the 900K+ open source projects we support. It’s a big part of why we enjoy what we do.
Currently, Travis CI has three types of hosted build environments for open source customers hosted on three different cloud providers:
sudo: false, container-based (Docker / Ubuntu) hosted on Amazon AWS EC2
sudo: required, full-VM (Ubuntu) hosted on Google Cloud Engine (GCE)
os: osx, macOS builds running on VMWare vCenter vSphere, hosted on MacStadium
Our free offering for open source accounts includes five concurrent jobs, shared amongst any of the three hosted build environments listed above.
Over time, the increase of paid customers for Travis CI has also meant growth in the services we provide for the open source community. We’ve always tried to keep up with the demand for free open source builds, but in the past year, the demand for macOS builds has been more than we can keep pace with.
Part of this demand comes from the increasing number of macOS open source projects out there, including projects that have multi-OS workflows. We expected some growth for multi-OS and macOS, and we’re happy it has exceeded our expectations - it’s great more open source projects exist - but such growth comes with complications. In Q3 alone, we had a near-permanent backlog of jobs waiting to start on mac infrastructure that would reach 1000-1200 jobs in the queue on average during peak times. Not only has this caused unbearable delays for our macOS users, but it has also created problems for repos/projects with multi-OS build pipelines.
Here is a graph of our queued macOS builds vs running macOS builds for 2017 Year-to-date. You’re probably already familiar with this, it’s composed by two of the metrics available at https://www.traviscistatus.com/. The red areas indicate queued backlog, “Backlog macOS Builds for Open Source projects”. The small amount of green areas underneath indicate our capacity being processed, “Active macOS Builds for Open Source projects”.
While there was a reduction of backlog in March following several months of infrastructure work and a doubling of our macOS build capacity for open source repositories, the high demand and consequent churn in our macOS infrastructure were more than our system could handle, resulting in two recent macOS outages. We scaled back capacity further in August and again in September to stabilize the macOS build infrastructure.
Unfortunately, a simple solution is not available to us. Scaling for macOS requires a significantly different system than for Linux. We intend to write an engineering blog post soon to share some of the internals of our macOS infrastructure, so that the challenges for hosted CI using macOS are better understood. There are significant differences compared to our full Linux VMs (GCE) and container-based Ubuntu Linux (AWS) infrastructures, including, unfortunately, lower build capacity limits.
Meanwhile, as we only have a finite amount of infrastructure resources available, we believe it necessary to adjust our open source offering for macOS. After much painstaking deliberation, we have decided to limit the concurrent macOS builds to a maximum of two out of the five offered. If you are using Linux only on .org, you will still have access to all five concurrent builds, but if you are running both, only two of the five concurrent builds can be used for macOS at any one time. We are hoping this will help distribute backlog more evenly across repositories and reduce wait time overall.
So what does this mean?
Let’s start by explaining how we schedule jobs, and what happens before a job starts:
The time it takes the job to get scheduled. Based on the amount of jobs that are already running for your organization or user, you might need to wait for jobs to get scheduled. travis-ci/travis-scheduler takes care of this.
The time for a scheduled job to get a free slot on the overall capacity available in each environment. Normally, jobs will find a slot right away in any of our Linux environments. Lately, this has not been the case for macOS, creating the red part of the graph above.
What will happen with reducing the concurrency for macOS, is that the time for a machine to be available will hopefully be reduced, whereas the time for a job to get scheduled will likely increase. Depending on how you use Travis CI (Linux, macOS or both), you might see an increase or a decrease in the time it takes one of your builds start.
We are also anticipating the recent release of default auto-cancellations for open source builds will help mitigate the load, by clearing redundant builds out of the macOS backlog. While this may not account for many builds, we want to do all we can.
Moving forward our overall aims are to:
- reduce wait times for everyone using open source travis-ci.org macOS builds
- reduce resource contention between our public and private repositories
- give you all more control over waiting time
- reduce any possible platform abuse, an unfortunate side effect of free services, so that resources are fairly shared between all legitimate users
We know the recent period has been especially trying for those of you using our macOS infrastructure. We truly appreciate the help we’ve received to determine the source of macOS backlog issues. Now that we’ve implemented these changes, we hope you will all have a better experience using Travis CI for your macOS and multi-OS open source builds.
For any questions please drop us an email at firstname.lastname@example.org. We’d be happy to help.