Skip to content

Learning

ccario83 edited this page Sep 30, 2017 · 30 revisions

Machine Learning with orchid-ml

Building a model

Machine learning can be performed in under 10 lines of code, and follows the general pattern:

  1. Specify access to the database generated by orchid-db with a SQL connection string.
  2. Load mutations and features either in their entirety or by a desired subset (e.g., by tumor).
  3. Encode categorical features using default or user-defined strategies (e.g., one-hot).
  4. Optionally collapse mutations by tumor (e.g., by averaging).
  5. Set a prediction label and select features.
  6. Model data with any of the scikit-learn machine learning algorithms.

Specify the database

This uses the public multi25 database used in our publication, but you can specify your own connection string for locally populated databases.

db_uri = "mysql://orchid:orchid@enterprise.ucsf.edu/multi25_20170710"

Load mutations and features

First, initialize the MutationMatrix using the database string just specified:

mutations = MutationMatrix(db_uri=db_uri)

To use hg38 build coordinates, request the hg38 table:

mutations.mutation_table = ssm_hg38

Hint: The step is not required when using hg19, as this table is used by default.

Then load mutations and features:

mutations.load_mutations()
mutations.load_features()

Encode categorical features

mutations.encode()

Hint: The last three commands can be combined using the command mutations.load_and_encode(), so these last three steps only count as one towards the promised ten lines of code. ;)


Collapse mutations

Optionally, you can collapse data by averaging over a set of column values. In this case, we may wish to average mutation values for each patient ('donor' in ICGC lingo):

mutations.collapse(by='donor_id')

Set a prediction label and select features

You may have a file that contains metadata for mutations. You can can load this information into python and merge it with the MutationMatrix based on mutation_id or some other unique identifier. Then you can use one of these columns as the prediction label for modeling.

mutations.import_metadata('donor_metadata.tsv')
mutations.set_label_column('Primary Site')

Now some feature selection can be done:

selected_features = mutations.set_features()
mutations.set_features(selected_features)

And finally, make a model

Now we're ready to model. Modeling functions take several parameters and use 10 fold cross-validation by default. Feel free to access the help from any function, for example by typing help(MutationMatrix.random_forest). Let's model with 10 fold cross-validation using a random forest:

mutations.random_forest()

Note: You can also model with a support vector machine like mutations.support_vector_machine() or any number of the scikit-learn classifiers (see documentation.

Hint: mutations.rf() and mutatinos.svm() can be used as shortcuts.

There you have it, ten lines of code (or twelve if you don't take the data loading shortcut).

Saving...

One last thing-- you probably want to save and restore your work. You can save the entire MutationMatrix with save_matrix() or just the model with save_model(). Here's an example of saving/restoring the MutationMatrix (includes the model):

mutations.save_matrix("example.pkl")
mutations = load_matrix("example.pkl" % tag)

Visualizing the model

You may want to assess the performance of your model. By default, the model functions will report cross-validation results and sanity checking. But maybe you'd like to get class specific ROC curves. That can be do like so:

mutations.show_roc_curves(show_folds=False)

Hint: Again, since there are several ways to plot, you may want to take a look at the help for this function: `help(MutationMatrix.show_curves)


You should get a series of images like this: example roc curve

OK. How about a confusion matrix?

sns.set_context("talk") 
tissues.show_confusion_matrix(cmap='RdPu')

We used a couple tricks here. For one, since orchid uses seaborn for plotting, we can use any seaborn functionality like setting the plot context. The sns.set_context("talk") command changes default label sizes to be more appropriate for slides of a presentation. The second trick is specifying a different color map or cmap than the default. You can use any matplotlib colormap. Here we chose the 'RdPu' colors. The generated plot looks like this:
example confusion matrix

There are many types and ways to plot. Please see the example for some ideas, or take a peek at the help functions or code.

Example

Please refer to the ipython notebook Tissue of Origin Example in the /notebooks folder for actual code on building a step-by-step example of how to build a tissue of origin classification model using orchid-ml, and examples of all the visualizations.

Clone this wiki locally