A few weeks ago, we had spurious alerts go off at fairly regular intervals,
usually even at the same time of day, when our open source platform was busiest.
One of our production alerts looks at the queue size of log message chunks
waiting. Those can pile up quickly when something goes wrong, delaying live log
updates when they do.
This started happening out of the blue, so we were very suspicious.
Unfortunately our metrics didn't show us much, as they're aggregated from all
dynos processing the chunks. As a quick fix we increased the number of processes
plowing through them, which reduced the number of occurrances but didn't make
the pain go away.
Looking at our RabbitMQ console, we noticed that some processes had less open
channels than others, meaning that they get closed for some unknown reason,
reducing the number of concurrent messages each process can handle, eventually
causing messages to queue up.
We started tracking more fine-grained metrics for every dyno, based on the dyno
number that's stored in the $DYNO environment variable. We soon noticed in the
metrics when certain processes suddenly dropped in their message processing
volume. In the graph below, you can see that before the first set of annotations
(an annotation is added every time one of our apps is deployed), there was only
a single metric tracking the aggregated processing volume.
With the first two annotations we deployed the change to get more metrics
detail. Normally the processing should be fairly smooth, but this graph shows
erradic behaviour of a few dynos. Normally, all processes should process a
somewhat equal amount of messages, as RabbitMQ distributes messages on a
round-robin basis for all open channels.
Now we were able to correlate these drops with the logs, and soon we found errors
of when channels were closed. We found a fix quickly and haven't had any alerts
You can see where the fix was deployed around the second set of annotations on
Having these more detailed metrics helped us figure out a problem much better
and faster than our accumulated metrics did. For dynos with a high processing
volume, make sure to have enough detail in your metrics to find culprits on
level as low as possible.
Alert fatigue is a real problem, any increase in alerts should be warning
sign for your team to investigate the issue as soon as possible. With
increasing spurious alerts, attention is drawn from the important production
What was the culprit?
So what caused the issue in the end?
A bug was introduced (by me!) in the message handling. When an error occured
parsing a message, and we noticed these errors in the logs around the time the
message processing dropped, some messages were acknowledge twice, which caused
an error from RabbitMQ and closed the channel, lowering the number of threads
processing messages every time.
When testing a private repository, you may need to fetch private dependencies,
like a private git submodule. A common approach to authorize a submodule is to use a private key
with access to multiple repositories on GitHub.
Until recently, the only way to set a custom SSH key was to put it in the .travis.yml file.
However, there were security concerns attached to this approach.
In order to improve that we're introducing a way to add an SSH key in the UI:
After clicking "Add a custom SSH key" you can add a key which will be then used in
To make it easier to identify the SSH key in use we also display a fingerprint in the build logs now:
The SSH key added through the UI is securely stored in our DB in an encrypted form.
To reduce any possible attack vector, we recommend using a user with as little
access as possible to only the repositories used as dependencies.
Other ways to add private dependencies
We hope that you will find the new UI addition useful. If there's anything
else you would like to know about this specific way of dealing with dependencies
or find out what are the alternatives, we created a documentation page on the subject:
We have added a lot of nice changes to our cookbooks recently, so
we decided to roll out the updates to you sooner rather than later!
Here are details of the August, 2014 updates.
Due to JDK bug discussed below, Oracle JDK 7 will remain at 7u60,
and Oracle JDK 8 at 8u5.
Even though OpenJDK 7u65 contains the bytecode verifier bug,
we are unable to offer a reasonable alternative without it.
Setting environment variables _JAVA_OPTIONS=-Xverify:none or
_JAVA_OPTIONS=-XX:-UseSplitVerifier should mitigate this issue.
This announcement originally mentioned MongoDB update from 2.4.x to 2.6.4.
We discovered a problem with the plan, however, and decided to postpone
this MongoDB update until we can provide a more solid upgrade plan.
We apologize for the inconvenience, and thank you for your understanding.
With this update, the default version of Leiningen will be 2.4.3.
Leiningen 1.x will be available as lein1.
lein will point to Leiningen 2.4.3, but lein2 will also be available as before.
Those repositories which use lein need to be updated to invoke lein1 instead.
Scala is updated to 2.11.2, sbt to 0.13.5.
In addision, Scala 2.9.2 and 2.10.2 are preinstalled.
These versions address problem with cross-compilation, and build failures
descriebed in http://www.typesafe.com/blog/what-happened-to-my-travis-ci, respectively.
Version updates include:
We've also added 5.5.9 back. This is the version supported on Ubuntu LTS14.04.