-
Notifications
You must be signed in to change notification settings - Fork 8
Learning
This document assumes you've previously installed orchid, please see installation instructions if you haven't already done so.
To demonstrate how orchid-ml works, we will perform a similar machine learning task to the one in our publication. This process follows the Tissue of Origin Example
jupyter notebook, which you load from the orchid repository. To do this, navigate to the orchid/notebooks
directory and startup jupyter notebook:
cd notebooks
jupyter notebook --ip=0.0.0.0 --port=8400
Then navigate to http://0.0.0.0:8400/ in browser and select the Tissue of Origin Example
notebook (you may need to use the token provided in the terminal when you first start). All the original notebook output is still here, but you can clear it before continuing by clicking Kernel => Restart and Clear Output.
This whole process takes about an hour using our remote MemSQL database. So, please be patient-- you're working with more than 1.2 billion data points here! If you'd like to skip the loading and pre-processing steps, simply import the pre-generated example MutationMatrix from our publication, tutorial_premodel.pkl
, and begin with the modeling section.
Note: Importing a saved matrix is described below using the
load_matrix()
function.
Machine learning can be performed in under 10 lines of code with orchid-ml, though we'll use a few more for this demonstration. A general modeling workflow typically goes like this:
- Specify access to the database generated by orchid-db with a SQL connection string.
- Load mutations and features either in their entirety or by a desired subset (e.g., by tumor).
- Encode categorical features using default or user-defined strategies (e.g., one-hot).
- Optionally collapse mutations by tumor (e.g., by averaging).
- Set a prediction label and select features.
- Model data with any of the scikit-learn machine learning algorithms.
In order to use orchid, you've got to import it. We'll also import pandas in case we need to do any basic dataframe manipulation.
from orchid_ml import MutationMatrix
import pandas as pd
This uses the public multi25 database used in our publication, but you can specify your own connection string for locally populated databases. Your connection string should look similar.
db_uri = "mysql://orchid:orchid@wittelab.ucsf.edu:9900/multi25_20170710"
First, initialize the MutationMatrix using the database string just specified:
mutations = MutationMatrix(db_uri=db_uri)
Next, if using hg38 build coordinates, request the hg38 table:
mutations.mutation_table = 'ssm_hg38'
Hint: The step is not required when using hg19, as this table is used by default.
Then load mutations and features:
donor_info = pd.read_csv("donor_metadata.tsv", sep="\t")
mutations.load_mutations(by='donor', ids=list(donor_info['donor_id']))
mutations.load_features()
Hint: Its extremely helpful to subset data at the load_mutation() step, otherwise the entire database will be loaded into memory. This is what we've done here with the
by
andids
parameters, which specify the column to subset by and which values to load. Here were getting only donors used in the publication using their ICGC donor IDs, since this is what's stored in the database. Thedonor_info
file also contains the Primary Site column, which we'll use later as a classifier label.
mutations.encode()
Hint: The last three commands can be combined using
mutations.load_and_encode()
. This function has all the same parameters as the individual functions.
As an option, you can collapse data by averaging over a set of column values. In this case, we wish to average mutation values for each patient (or donor in ICGC lingo). We use the same donor_ids
that we loaded data with:
mutations.collapse(by='donor_id')
In order to classify with a supervised learning model, we need correct labels for our data. From the metadata file we imported earlier, we have the Primary Site column, which is the tissue of the donors tumor. We now tell orchid to use this column for learning.
# Create a dictionary mapping donor_id to Primary Site
mapping = pd.DataFrame(donor_info[['donor_id', 'Primary Site']], columns=['donor_id', 'Primary Site'])
# Add the labels to the MutationMatrix
mutations.add_labels(mapping)
# Set the label column
mutations.set_label_column('Primary Site')
Now some feature selection can be done:
selected_features = mutations.select_features()
Finally, we tell orchid to just use the selected features from now on (we'll take the 20 best):
mutations.set_features(selected_features['Feature'][0:20])
Now we're ready to model. Modeling functions take several parameters and use 10 fold cross-validation by default. Feel free to access the help from any function, for example by typing help(MutationMatrix.random_forest)
. Let's model with 10 fold cross-validation using a random forest:
mutations.random_forest()
Note: You can also model with a support vector machine like
mutations.support_vector_machine()
or any number of the scikit-learn classifiers (see documentation.
Hint:
mutations.rf() and mutatinos.svm() can be used as shortcuts.
There you have it, you're first orchid model!
You probably want to save and restore your work at various points in a workflow. You can save just the model with save_model()
or the entire MutationMatrix, including model, with save_matrix()
. Here's an example of saving/restoring the MutationMatrix:
# Saving
mutations.save_matrix('models/tutorial_premodel.pkl')
# Restoring
from orchid_ml import load_matrix
mutations = load_matrix('models/tutorial_premodel.pkl')
A model isn't very good if you don't know how well it performed. By default, model functions will report cross-validation results and sanity checking when its run. But maybe you'd like to get class specific ROC curves. That can be do like so:
mutations.show_roc_curves(show_folds=False)
Hint: Again, since there are several ways to plot, you may want to take a look at the help for this function:
help(MutationMatrix.show_curves)
.
Hint: PR curves can also be generated with mutations.show_pr_curves().
You should get a series of images like this:
OK. How about a confusion matrix?
sns.set_context("talk")
tissues.show_confusion_matrix(cmap='RdPu')
We used a couple tricks here. For one, since orchid uses seaborn for plotting, we can use any seaborn functionality like setting the plot context. The sns.set_context("talk")
command changes default label sizes to be more appropriate for slides of a presentation. The second trick is specifying a different color map or cmap
than the default. You can use any matplotlib colormap. Here we chose the 'RdPu' colors. The generated plot looks like this:
There are many types and ways to plot. Please see the examples for some ideas, or take a peek at the help functions or at the code.
Please refer to the ipython notebook Tissue of Origin Example
in the /notebooks
folder for this example code on building a tissue of origin classification model with additional visualization models. There is also a notebook for the actual code
used for our paper, which is only slightly more sophisticated, and already out-dated. ;)