Two weeks ago, on Wednesday, March 27th through the 28th, we had a major outage affecting both of our platforms, travis-ci.com and travis-ci.org. For about 20 hours, all builds on our Linux and Windows infrastructures were delayed.
We are deeply sorry for the inconvenience this has caused you, and we want to let you know what we’ve found and the measures we’ve taken to improve the reliability of our service.
Around 14:45 UTC on March 27th, we started receiving reports of long virtual machine boot times for Linux builds. Upon investigation, our monitoring system showed that the shared Linux and Windows infrastructure had quickly become over capacity.
Right after this, we found that an increment in build jobs traffic led to the creation of more VM instances than usual and those VM instances were becoming stalled a couple of minutes after their creation.
At that moment, our instances cleanup system was set to delete stopped instances older than 3 hours. Given the number of previously created instances, this provoked many delete requests to our cloud provider, which then led to our API calls being rate limited.
Being rate limited meant that we had lost the ability to delete those stopped instances, as well as to create new ones, and as more and more instances were becoming terminated, a backlog started to form which made booting Linux builds slower. Due to our inability to delete instances, we eventually reached computing resource limits (e.g. SSD quota), causing requeues which further increased our number of API calls.
To mitigate and later resolve this situation we reduced the number of API calls made, by reducing our own self-imposed rate limit. At the beginning of the outage, it was set too low, which was also causing build jobs to time out while waiting to make an API call.
After this, we were able to pass through the API limits to make calls and delete stopped VM instances again, slowly increasing our capacity and waiting for the backlog of build jobs queued to be processed.
Measures taken after the incident
Our cleanup system has been updated to delete stopped instances right away, instead of waiting for 3 hours to clean them up. By doing this, we are deleting them as they become terminated, reducing the number of requests done at the same time.
Besides reducing the number of API requests when deleting instances, we are also working on improving all API calls to our cloud provider, to make all of our requests more efficient and avoid being rate limited again in the future.
We are also adding more metrics to our system to improve the visibility on us possibly hitting rate limits, adding additional logging to the cleanup system, to get alerted in a timely manner.
We have always used situations like this one, as an opportunity for us to improve and grow, and this incident will be no exception.
We can’t express how sorry we are for all the trouble this has caused to your development pipeline. Besides the measures taken to resolve this incident, we are actively working to improve our system and bring you a better experience.