How Finer-Grained Metrics in a High Traffic Component Helped us Uncover a Code Bug

Mathias Meyer's Gravatar Mathias Meyer,

A few weeks ago, we had spurious alerts go off at fairly regular intervals, usually even at the same time of day, when our open source platform was busiest.

One of our production alerts looks at the queue size of log message chunks waiting. Those can pile up quickly when something goes wrong, delaying live log updates when they do.

This started happening out of the blue, so we were very suspicious. Unfortunately our metrics didn't show us much, as they're aggregated from all dynos processing the chunks. As a quick fix we increased the number of processes plowing through them, which reduced the number of occurrances but didn't make the pain go away.

Looking at our RabbitMQ console, we noticed that some processes had less open channels than others, meaning that they get closed for some unknown reason, reducing the number of concurrent messages each process can handle, eventually causing messages to queue up.

How did we find the problem?

We pulled a trick from a previous blog post we pushed a few weeks ago, on how you can aggregate metrics across multiple Heroku processes.

We started tracking more fine-grained metrics for every dyno, based on the dyno number that's stored in the $DYNO environment variable. We soon noticed in the metrics when certain processes suddenly dropped in their message processing volume. In the graph below, you can see that before the first set of annotations (an annotation is added every time one of our apps is deployed), there was only a single metric tracking the aggregated processing volume.

With the first two annotations we deployed the change to get more metrics detail. Normally the processing should be fairly smooth, but this graph shows erradic behaviour of a few dynos. Normally, all processes should process a somewhat equal amount of messages, as RabbitMQ distributes messages on a round-robin basis for all open channels.

Now we were able to correlate these drops with the logs, and soon we found errors of when channels were closed. We found a fix quickly and haven't had any alerts since.

You can see where the fix was deployed around the second set of annotations on July 4th.

Having these more detailed metrics helped us figure out a problem much better and faster than our accumulated metrics did. For dynos with a high processing volume, make sure to have enough detail in your metrics to find culprits on level as low as possible.

Alert fatigue is a real problem, any increase in alerts should be warning sign for your team to investigate the issue as soon as possible. With increasing spurious alerts, attention is drawn from the important production issues.

What was the culprit?

So what caused the issue in the end?

A bug was introduced (by me!) in the message handling. When an error occured parsing a message, and we noticed these errors in the logs around the time the message processing dropped, some messages were acknowledge twice, which caused an error from RabbitMQ and closed the channel, lowering the number of threads processing messages every time.

The fix turned out to be just moving the acknowledge of the message outside of a retry block we were using to catch errors related to writing to the database. Simple, yet very much facepalm-worthy.

It all boiled down to the finer grained metrics in these high volume processes to eventually find and fix the culprit.


Manage Private Dependencies More Easily

Piotr Sarnacki's Gravatar Piotr Sarnacki,

When testing a private repository, you may need to fetch private dependencies, like a private git submodule. A common approach to authorize a submodule is to use a private key with access to multiple repositories on GitHub.

Until recently, the only way to set a custom SSH key was to put it in the .travis.yml file. However, there were security concerns attached to this approach.

In order to improve that we're introducing a way to add an SSH key in the UI:

SSH Key screen in the Repository Settings
SSH Key screen in the Repository Settings

After clicking "Add a custom SSH key" you can add a key which will be then used in your builds:

Adding an SSH Key in the Repository Settings
Adding an SSH Key in the Repository Settings

To make it easier to identify the SSH key in use we also display a fingerprint in the build logs now:

SSH Key fingerprint in logs
SSH Key fingerprint in logs

Security

The SSH key added through the UI is securely stored in our DB in an encrypted form. To reduce any possible attack vector, we recommend using a user with as little access as possible to only the repositories used as dependencies.

Other ways to add private dependencies

We hope that you will find the new UI addition useful. If there's anything else you would like to know about this specific way of dealing with dependencies or find out what are the alternatives, we created a documentation page on the subject: Private dependencies.


Upcoming Build Environment Updates -- August

Hiro Asari's Gravatar Hiro Asari,

We have added a lot of nice changes to our cookbooks recently, so we decided to roll out the updates to you sooner rather than later!

Here are details of the August, 2014 updates.

Update

Due to JDK bug discussed below, Oracle JDK 7 will remain at 7u60, and Oracle JDK 8 at 8u5.

Even though OpenJDK 7u65 contains the bytecode verifier bug, we are unable to offer a reasonable alternative without it. Setting environment variables _JAVA_OPTIONS=-Xverify:none or _JAVA_OPTIONS=-XX:-UseSplitVerifier should mitigate this issue.

Update

This announcement originally mentioned MongoDB update from 2.4.x to 2.6.4. We discovered a problem with the plan, however, and decided to postpone this MongoDB update until we can provide a more solid upgrade plan. We apologize for the inconvenience, and thank you for your understanding.

Update schedule

The updates will be rolled out to travis-ci.org at 14:00 UTC on the 27th of August and to travis-ci.com at 14:00 UTC on the 29th of August.

Build Environment Updates

All environments will receive the following updates:

Chromium browser

34.0.1847.116 → 36.0.1985.125

CouchDB

1.5.0 → 1.6.0

ElasticSearch

1.1.1 → 1.3.2

This change contains breaking changes: 1.1 → 1.2, 1.2 → 1.3.

If you need to revert to ElasticSearch 1.1.1 because of these breaking changes, remove the installed version and install 1.1.1:

before_install:
  - sudo apt-get purge elasticsearch
  - curl -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.deb
  - sudo dpkg -i elasticsearch-1.1.1.deb

Firefox browser

Included is an update to the latest Extended Support Release (ESR), 31.0esr, which corresponds to Desktop Firefox 31.0.

MySQL

5.5.37 → 5.5.38

OpenJDK 6

6b31 → 6b32

OpenJDK 7

7u55 → 7u65

PostgreSQL

  1. 9.1.13 → 9.1.14
  2. 9.2.8 → 9.2.9
  3. 9.3.4 → 9.3.5

RabbitMQ

3.3.4 → 3.3.5

Sphinx

  1. 2.2.2-beta → 2.2.3-beta
  2. 2.1.8 → 2.1.9

Android VM

Android SDK is updated to 23.0.2. build-tools-20.0.0 is pre-installed.

Gradle 2.0

Gradle has been updated to version 2.0. This release contains potential breaking changes. If you need to go back to version 1.11, add the following to your .travis.yml:

before_install:
  - sudo rm -r /usr/local/gradle
  - curl -LO http://services.gradle.org/distributions/gradle-1.11-bin.zip
  - unzip -q gradle-1.11-bin.zip
  - sudo mv gradle-1.11 /usr/local/gradle

Haskell VM

Platform is updated to 2014.2.0.0. GHC 7.8.2 is updated to 7.8.3. (Other versions remain the same.)

Java VM (not JVM)

  • Maven is updated to 3.2.3.
  • Leiningen2 is updated to 2.4.3, which is now the default.

Gradle 2.0

See notes above.

Leiningen will default to 2.x

With this update, the default version of Leiningen will be 2.4.3. Leiningen 1.x will be available as lein1.

lein will point to Leiningen 2.4.3, but lein2 will also be available as before.

Those repositories which use lein need to be updated to invoke lein1 instead.

Scala

Scala is updated to 2.11.2, sbt to 0.13.5.

In addision, Scala 2.9.2 and 2.10.2 are preinstalled. These versions address problem with cross-compilation, and build failures descriebed in http://www.typesafe.com/blog/what-happened-to-my-travis-ci, respectively.

PHP VM

Version updates include:

  • 5.6.0rc4
  • 5.5.16
  • 5.4.32
  • 5.3.29

We've also added 5.5.9 back. This is the version supported on Ubuntu LTS14.04.

Go forth and test!

Happy testing!

Love,

Travis Team