At The Data Incubator, we pride ourselves on having the latest data science curriculum. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.
Below is a ranking of Python packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as PyPI (The Python Package Index) downloads.
The table shows standardized scores, where a value of 1 means one standard
deviation above average (average = score of 0). For example, numpy
is 2
standard deviations above average in Stack Overflow activity, while
tensorflow
is close to average. See below for methods.
The ranking is based on equally weighing its three components: Github (stars
and forks), Stack Overflow (tags and questions), and PyPI downloads. These were
obtained using available APIs and a fork of the vanity
project. Coming up
with a comprehensive list of data science packages was tricky - in the end, we
scraped three different lists that we thought were representative
(see methods below for details). Computing standardized scores for
each metric allows us to see which packages stand out in each category (look at
tensorflow
's Github score!).
Due to massive total downloads and strong Stack Overflow activity, the clear
leader is numpy
(a fundamental package for scientific computing with
Python).
However, the scalable machine learning package tensorflow
(stared at Google)
trounces the other libraries in Github activity (based on both stars and
forks), with the more general machine learning module scikit-learn
a distant
second, but fifth overall.
Both numpy
and pandas
(high-performance data structures and data analysis
package) are only average on Github, but strong in the other two categories.
The interactive interpreter ipython
is fourth overall, while the jupyter
project (of the popular notebook) is 19th overall (not shown).
The full ranking is here, while the raw data is here.
There appears to be an inverse correlation in Github activity compared to Stack
Overflow and Downloads for the top packages. For example, there are a lot of
Stack Overflow questions for numpy
and pandas
compared to tensorflow
and
scikit-learn
, but the latter two have an edge on Github. Since numpy
and
pandas
are two "utility" packages, perhaps more people are actually using
them (and need help).
Among the core libraries, numpy
is a clear first, with pandas
third,
ipython
fourth, and scipy
(an ecosystem of open-source software for
mathematics, science, and engineering) ninth.
The other deep-learning package, Theano
, is a big distance behind
tensorflow
in this ranking. The interactive and
polished TensorFlow Playground could be a
factor.
As expected, matplotlib
(2D plotting library) is the most popular graphics
package, but the ranking also features plotly
(interactive,
publication-quality graphs that can be easily published online) and bokeh
(an
interactive visualization library that targets modern web browsers for
presentation). ggpy
(Python port for R's popular ggplot2
package) was 18th
overall, but the data is less reliable, as noted in the next section.
As with any analysis, decisions were made along the way. All source code and data is on our Github Page.
The full list of machine learning packages came from a few sources,
and a few packages were unranked, due to unavailable downloads or Github
data. These are: basemap
(mapping with matplotlib
), d3py
(D3-like
plotting), jupyter-notebook
, mlpy
(machine learning based on scipy
and
numpy
), pylearn2
(machine learning, based on theano
), pytables
(big
tables), and shogun
(machine learning). They were all below average compared
to the ranked packages, in all categories.
Importantly, the Anaconda distribution bundles together many of these packages, and this was not considered.
Further, naturally, some packages that have been around longer will have higher metrics, and therefore higher ranking. This is not adjusted for in any way.
The data presented a few difficulties:
- The python port for
ggplot
was recently renamed toggpy
, and we used the latter for all metrics, except downloads (which usedggplot
). ipython notebook
is nowjupyter notebook
. Stack overflow auto-correctedipython-notebook
tojupyter-notebook
so we combined these results withjupyter-notebook
results from the other two sources. Butjupyter-notebook
didn't have a downloads count, so it doesn't feature in the final ranking. Instead, we also performed analyses foripython
andjupyter
individually, which do appear in the ranking.- The Pattern package has inflated Stack Overflow (SO) question metrics since it's a common word. It has no tags data, as SO auto-corrects the query "[pattern]" to something unrelated.
- SO data for
plotly
may also be inflated, as it's both an R and Python package.
All source code and data is on our Github Page.
We first generated a list of Data Science packages from these three sources, and then collected metrics for all of them, to come up with the ranking.
Github data is based on both stars and forks, while Stack Overflow data is based
on tags and questions containing the package name. Downloads are from PyPI,
using
a fork of the vanity
project. Other
projects to get download counts
include pypi-download-stats
(couldn't get it to work)
and pypi-ranking.info
(gives smaller
numbers than vanity
).
A few other notes:
sqlite3
was removed from analysis, as it is a base Python module.ggplot
downloads were combined withggpy
results from Github and Stack Overflow.- Any unavailable Stack Overflow counts were converted to zero count.
- Counts were standardized to mean 0 and deviation 1, and then averaged to get Github and Stack Overflow scores, and, combined with the Downloads, the Overall score.
- Some manual checks were done to confirm Github repository location.
All data was downloaded on January 26, 2017.
Source code is available on The Data Incubator's Github. If you're interested in learning more, consider
- Data science corporate training
- Free eight-week fellowship for masters and PhDs looking to enter industry
- Hiring Data Scientists
Michael Li and Paul Paczuski.