Travis

FAQ: New GCE Trusty and Precise Image Stacks & Outage from 9 August 2016

Carmen Andoh w/ Anna Nagy and Lena Reinhard,

GCE Images Outage FAQ

After our recent GCE Images Outage, we want share more information on who was affected by the outage, what happened, and answer a couple of frequently asked questions below.

We’re sorry for any issues that this outage may have caused for you. If you can’t find your issue below, or a described fix doesn’t work for you, please don’t hesitate to get in touch with our support engineers at support@travis-ci.com. Thank you for your understanding!

Details

Who is affected by the GCE Images Outage?

This outage only applies to trusty or precise images running on GCE TrustyBeta or Standard sudo: required environments*. Neither the sudo: false container-based builds nor OSX builds were affected. You’ll find general information about the Travis Build Environment in our Build Environment Documentation.

How many repositories were affected?

While we are still gathering information on the full scope of the outage, we do know at this time that out of all builds on all infrastructures, approximately 4.5% had either a “fail” or “errored” state after previously having a “passed” state before / after the outage. Our GCE infrastructure accounts for 25-30% of this total, so this brings it down to 1.5-2% of all builds. We are working to query more granular statistics for specific repos affected on the build environments.

What exactly happened?

On Tuesday, August 9th, at 1400 UTC, a GCE infrastructure image cleanup script malfunctioned and deleted all precise and trusty build environment images on GCE created before August 7th, 2016. These images were permanently deleted and could not be recovered. Though we've been updating our Chef cookbooks for the trusty and precise images, the image stacks on GCE themselves were not updated. In addition, new trusty and precise images that have been in production since August 1st were being tested out in group: edge and group: dev. After deliberating between re-creating existing images or rolling forward to new ones, we decided to roll forward with these newer build environments.

What is the current status?

As of now, the images previously tagged group: edge and group: dev are now the default unless specified otherwise.

What’s the status of fixing this issue?

We’ve stabilized the cleanup process with a fix to prevent further occurrences of this same problem. This is a semi-permanent measure and we’re working on a long-term solution. We’ll publish details as soon as this is completed.


Fixed Issues

Network IPv6 errors.

Though GCE disables IPv6, we discovered the that the new ubuntu trusty source image enabled IPv6, causing build errors that were using IPv4. fixed

The command sudo pip install . fails because the process can't write to /opt/:

The issue has been fixed.

My builds seems to work fine, but all my integration tests fail

We've released fixes to our build images over the last days which should fix this issue. Please restart your build or push a new commit. If this doesn’t solve the issue, please get in touch with our support engineers at support@travis-ci.com, and please include a link to the resulting build log if your build is still broken.

git config user.* errors / git commit errors

An issue was caused by new images for tests that attempted to git commit and failed due to both no user.{$VALUE} git config and the inability of git to fall back to concocting a fallback last-ditch user.email from the hostname. This led to git commit errors for some users. A fix has been merged and is available on trusty images with group: edge.

perlbrew missing in new images

An issue where new trusty images did not contain the perlbrew binary has been fixed.

MongoDB: the service is now mongod on all distributions except Precise

We install MongoDB 3.x on all distributions except Precise, the name of the service is now mongod. On Precise the name of the service remains mongodb. Problems starting MongoDB have been fixed.

MongDB: service missing on dist: trusty; group: edge

Issues with the mongodb service missing from dist: trusty; group: edge have been fixed.

Go binary unavailable in non-Go builds

An issue where the Go binary was unavailable in non-Go builds has been fixed.

PHP runtime issues

We've configured many php runtimes, with the most recent being phpenv global hhvm. If you are experiencing compile or runtime errors, please submit email support@travis-ci.org


Known Issues

The following issues are actively being worked on or still under investigation. For a majority of them, we’ve currently published workarounds. We’ll update the linked GitHub issues and the blog with further information as it becomes available.

Disk space running out / no space left on device

We received reports about people receiving a notification about disk space running out on the new sudo-enabled images on tests that passed prior to the recent updates. If you’re experiencing this issue, please insert the following in a build: after_failure: - df -h and report the results to support@travis-ci.com.

selenium-webdriver issues

We’re continuing to investigate several permutations of selenium-webdriver issues. We have deployed a workaround to some issues involving selenium server at the network error level.

However, these workarounds do not work for all selenium-related issues. If you are experiencing issues not covered by this patch, please get in touch with us at support@travis-ci.com.

php pecl configuration is incorrect on Trusty

Issue here the pre-installed versions remain, and they need to be updated with the new archives.

docker gateway changed

Many changes come with the upgrade from docker 1.9.1 to 1.12.0. The docker gateway used to be configured as 172.72.42.1, and it has changed to the docker standard 172.17.0.1. Specific issues with gateway configuration have been fixed by updating to the new address thus far.

docker and docker-compose version upgrades

Upgrades for docker engine and docker-compose (for dist: trusty with service: docker only). Detailed instructions for downgrading are in this issue comment. If you don’t wish to downgrade, please consult the docker changelog and docker-compose changelog


We continue to work on fixing the open issues for images and build runtime, and will publish a detailed incident report after when all the underlying problems are addressed.

If you can’t find your issue among the ones described above, or a fix / workaround doesn’t solve your issue, please contact our support engineers at support@travis-ci.com. Thank you for your patience!

-The Travis Team


Job config is here!

Lisa Passing,

Today we released a new feature in our web UI, the job config 🎉

You can can find it wherever we display the job log, in a new secondary tab navigation.

where to find the job config

We store a job's config as JSON string and that's how our API delivers it. This is also how we present it in the UI.

"Why display the config of a job?" you might wonder. A lot of projects have build matrixes configured (for example to test on different operating systems), meaning a single push triggers multiple jobs, each with a different configuration. If no matrix is configured the build will start with one job only.

Having an overview over how a job is configured gives insight into what's happening in the job log and hopefully answers questions about why certain things are executed the way they are.

We hope you find this just as useful as we do!

Happy configuring and building!


GCE Images outage

Brandon Burton,

We are deeply sorry for the outage of our sudo-enabled Precise and Trusty builds this morning.

We've determined that an image cleanup process malfunctioned and deleted the stable build environment images. Due to various factors our only option at this point is roll forward with newer build environment images, for both Precise and Trusty.

We're resuming builds with these new images, but these new images have not undergone thorough testing, so it is possible your builds will break.

If you see issues with your builds failing, please email support@travis-ci.com, our engineering staff is available to help with any issues you run into.

We'll also continuing to investigate the exact cause of the issue and will be publishing a postmortem as we learn more.