This is a guest post from Moritz Beller from the Delft University of Technology in The Netherlands. His team produced amazing research on several million Travis CI builds, creating invaluable insights for us and, more importantly, the open source community as a whole. They’ve use the data they’ve collected in their research and turned it into an amazing project and build log treasure trove, TravisTorrent. Thank you! – Mathias Meyer, CEO
Last year in July, I had the great opportunity to pitch to readers of this blog what we researchers at the Delft University of Technology learned from analyzing 2+ million build logs on Travis CI. Today, I want to draw your attention to how easily you can discover interesting facts about Travis CI-enabled projects yourself, kickstart your career as a data miner, and possibly win an awesome prize to spend on Travis CI’s products.
Every year since 2006, the Mining Software Repositories (MSR) conference has hosted a special track called the “Mining Challenge,” in which one common data set is made publicly available and everyone is invited to do their most daring and most creative research on it, write it down in a report, and send it in. In past years, the data sets were about the Gnome desktop project, GitHub, and StackExchange. Well, this year, the data set is on Travis CI, and more specifically its build logs, which are an invaluable and rich source of information that is often overlooked and not easy to extract in the first place. What’s more, maybe even your project is included in the data set.
The good news is: We have already done the parsing, and our data set TravisTorrent contains more than 60 facts for each of the 3,7 million Travis CI jobs included in the data set. These facts are very diverse data and they reach from which test failed, how many people contribute to a project to what its previous build was. Just to inspire you a bit, this data set allows you to find out how long builds are typically broken, whether this might have an adverse influence on the number of participating developers, or whether just one test might be responsible for breaking most of the builds, for example. The possibilities are almost endless.
Despite its enormous raw size of over 2 TB, working with the TravisTorrent data set is very easy, since the aggregated version relevant for the challenge is just 3.2 GB or a mere 200 MB when zipped. Moreover, we provide a MySQL web interface for your first steps and offer CSV and SQL download options. We also have extensive documentation for each of the columns we gather and if there are still questions, we have an active disqus community answering questions.
Taking part in the challenge is as easy as playing around with the data set and then submitting a paper on your findings. The deadline is February 20th and the paper may have at max 4 pages, so there is enough time left. As an additional bonus, Travis CI sponsors the mining challenge with a 200$ voucher.
Lastly, I want to stress that really everybody can take part. It is absolutely not necessary to have a university background. In fact, we welcome all contributions and to make sure everybody has the same chances, the identity of the authors will be blinded during review.