-
Notifications
You must be signed in to change notification settings - Fork 8
Learning
Machine learning can be performed in under 10 lines of code, and follows the general pattern:
- Specify access to the database generated by orchid-db with a SQL connection string.
- Load mutations and features either in their entirety or by a desired subset (e.g., by tumor).
- Encode categorical features using default or user-defined strategies (e.g., one-hot).
- Optionally collapse mutations by tumor (e.g., by averaging).
- Set a prediction label and select features.
- Model data with any of the scikit-learn machine learning algorithms.
This uses the public multi25 database used in our publication, but you can specify your own connection string for locally populated databases.
db_uri = "mysql://orchid:orchid@enterprise.ucsf.edu/multi25_20170710"
First, initialize the MutationMatrix using the database string just specified:
mutations = MutationMatrix(db_uri=db_uri)
To use hg38 build coordinates, request the hg38 table:
mutations.mutation_table = ssm_hg38
Hint: The step is not required when using hg19, as this table is used by default.
Then load mutations and features:
mutations.load_mutations()
mutations.load_features()
mutations.encode()
Hint: The last three commands can be combined using the command
mutations.load_and_encode()
, so these last three steps only count as one towards the promised ten lines of code. ;)
Optionally, you can collapse data by averaging over a set of column values. In this case, we may wish to average mutation values for each patient ('donor' in ICGC lingo):
mutations.collapse(by='donor_id')
You may have a file that contains metadata for mutations. You can can load this information into python and merge it with the MutationMatrix based on mutation_id or some other unique identifier. Then you can use one of these columns as the prediction label for modeling.
mutations.import_metadata('donor_metadata.tsv')
mutations.set_label_column('Primary Site')
Now some feature selection can be done:
selected_features = mutations.set_features()
mutations.set_features(selected_features)
Now we're ready to model. Modeling functions take several parameters and use 10 fold cross-validation by default. Feel free to access the help from any function, for example by typing help(MutationMatrix.random_forest)
. Let's model with 10 fold cross-validation using a random forest:
mutations.random_forest()
Note: You can also model with a support vector machine like
mutations.support_vector_machine()
or any number of the scikit-learn classifiers (see documentation.
Hint:
mutations.rf() and mutatinos.svm() can be used as shortcuts.
There you have it, ten lines of code (or twelve if you don't take the data loading shortcut).
One last thing-- you probably want to save and restore your work. You can save the entire MutationMatrix with save_matrix()
or just the model with save_model()
. Here's an example of saving/restoring the MutationMatrix (includes the model):
mutations.save_matrix("example.pkl")
mutations = load_matrix("example.pkl" % tag)
You may want to assess the performance of your model. By default, the model functions will report cross-validation results and sanity checking. But maybe you'd like to get class specific ROC curves. That can be do like so:
mutations.show_roc_curves(show_folds=False)
Hint: Again, since there are several ways to plot, you may want to take a look at the help for this function: `help(MutationMatrix.show_curves)
You should get a series of images like this:
OK. How about a confusion matrix?
sns.set_context("talk")
tissues.show_confusion_matrix(cmap='RdPu')
We used a couple tricks here. For one, since orchid uses seaborn for plotting, we can use any seaborn functionality like setting the plot context. The sns.set_context("talk")
command changes default label sizes to be more appropriate for slides of a presentation. The second trick is specifying a different color map or cmap
than the default. You can use any matplotlib colormap. Here we chose the 'RdPu' colors. The generated plot looks like this:
There are many types and ways to plot. Please see the example for some ideas, or take a peek at the help functions or code.
Please refer to the ipython notebook Tissue of Origin Example
in the /notebooks
folder for actual code on building a step-by-step example of how to build a tissue of origin classification model using orchid-ml, and examples of all the visualizations.