This repository contains the tools used to generate the data on TravisTorrent: These include the
- Travis Poker (
bin/travis_poker.rb
), which pokes en-mass whether a project has a Travis build history, - Travis Harvester which downloads Travis build logs (
bin/travis_harvester.rb
), - Travis BuildLog Analyzer (
bin/buildlog_analysis.rb
) - Build Metadata extractor (
bin/build_data_extraction.rb
)
The following works on Debian Jessie
$ apt-get install ruby ruby-dev bundler pkg-config libmysqlclient-dev
$ git clone git@github.com:TestRoots/travistorrent-tools.git
$ cd travistorrent-tools
$ bundle install
The file projects.txt
contains a list of non-toy, non-fork, active GitHub projects. It was retrieved from GHTorrent by running the query:
select u.login, p.name, p.language, count(*)
from projects p, users u, watchers w
where
p.forked_from is null and
p.deleted is false and
w.repo_id = p.id and
u.id = p.owner_id
group by p.id
having count(*) > 50
order by count(*) desc
You can then call the Travis Poker to see whether these projects use Travis CI or not. Projects will be annotated with a binary flag indicating this.
To further process the list generated by Travis Poker, do
grep "true" results.csv > travis_enabled
sed -i 's/\([^,]*\),\([^,]*\).*/\1 \2/' travis_enabled
This list can now be passed to the Travis Harvester, for which we use parallel.
Retrieve build logs of 20 GH project simultaneously (beware, depending on your network connection this puts a heavy load on Travis-CI!)
cat travis-enabled | parallel -j 20 --colsep ' ' ruby bin/travis_harvester.rb
To extract features for one project, do
ruby -Ibin bin/build_data_extraction.rb stripe brushfire github-token
where github-token
is a valid GitHub OAuth token used to download information
about commits. To configure access to the required GHTorrent MySQL and MongoDB
databases, copy config.yaml.tmpl
to config.yaml
and edit accordingly. You
can have direct access to the GHTorrent MySQL and MongoDB databases using
this link.
To extract features for multiple projects in parallel, you need
- A file (
project-list
) of projects, in the format specified above - A file (
token-list
) of one or more Github tokens, one token per line
Then, run
./bin/project_token.rb project-list token-list | sort -R > projects-tokens
./bin/all_projects.sh -p 4 -d data projects-tokens
this will create a file with tokens equi-distributed to projects
a directory data
, and start 4 instanced of the build_data_extraction.rb
script
Our buildlog dispatcher handles everything that you typically want: It generates one convenient output file (a CSV) per project directory, and invokes an automatically dispatched correct buildlog analyzer. You can start the per-project analysis (typically on a directory structured checkedout through travis-harvester) via
ruby bin/buildlog_analysis.rb directory-of-project-to-analyze
To start to analyze all buildlogs, parallel helps us again:
ls build_logs | parallel -j 5 ruby bin/buildlog_analysis.rb "build_logs/{}"
http://docs.travis-ci.com/user/customizing-the-build/
broken <- (errored|failed) errored <- infrastructure failed <- tests canceled <- user abort
If any of the commands in the first four stages returns a non-zero exit code, Travis CI considers the build to be broken.
When any of the steps in the before_install, install or before_script stages fails with a non-zero exit code, the build is marked as errored.
When any of the steps in the script stage fails with a non-zero exit code, the build is marked as failed.
Note that the script section has different semantics to the other steps. When a step defined in script fails, the build doesn’t end right away, it continues to run the remaining steps before it fails the build.
Currently, neither the after_success nor after_failure have any influence on the build result. Travis have plans to change this behaviour