Prediction on Cancer Cells

Abstract

For this machine learning project we were tasked to teach a binary classifier to identify if a given cancer cell could survive in a low oxygen environment (i.e. hypoxia) or if the cell needs oxygen to prosper (i.e. normoxia). We used data from an experiment which sequenced RNA from various breast cancer cells. Some cells came from a cell line that was in a low oxygen environment (~1%) and the other cells came from a cell line that was exposed to normal levels of oxygen. The aim for our binary classifier is to identify which genes (found in the RNA) can be attributed to the ability to survive in a low oxygen environment. Intuitively, if a gene were very present in cell from the hypoxia batch and not very present in the normal batch this could possibly mean that this gene helps cancer cells to survive even with very limited oxygen. From a medical point of view, this could help determine whether a certain cancer cell would need to be near arteries or if it could multiply even without a direct source of oxygen.

Data

We were given data derived utilizing Smart-Seq and Drop-Seq as a sequencing technique. The cell types included in the datasets were MCF7 and HCC1608. As features, we have various genes what were found when sequencing RNA from the various cells. The data was provided to us by Bocconi University, and will not be attached to the Repository.

Overview and Results

In this section we go over some main results and approaches used in the analysis on the SmartSeq Dataset (for the DropSeq dataset, the techniques used for the analysis are the same).

Data Analysis

The first step of our analysis was to perform some general data analysis on the genes and cells of our dataset, looking for some recurrent data, duplicate datapoints or highly correlated features, or characterizing and explaining some information in a clear way.

First of all we look at the distribution of gene expression for some chosen genes:
The above plot (describing the gene expression in the normalized dataset) shows a consistent trend of high frequency of less expressed genes, with some of them being a bit more consistent over a larger expression amount.

We also looked at the correlation between different genes through a correlation heatmap, where we can see that several pairs of genes are strongly correlated, and very few of them end up being actually uncorrelated: A more in-depth analysis led to the understanding that the low-correlation genes are often characterized by a much larger amount of zero-entries (~95%) compared to the average (~60%).

Finally, we looked at the genes showing a strong difference between the two exam groups (normoxia and hypoxia), which could be useful in performing some sanity checks for the models later on (ex. Linear Regression and Random Forests):

Dimensionality Reduction

Before dwelling in the more common ML approaches and models, we performed dimensionality reduction to observe if the two groups of data end up being clustered in different ways in a low dimensional space.
On top of this, in this improved Repository, I implemented some "supervised" Dimensionality Reduction approaches, using the known labels to obtain low-dimensional embeddings capable of encoding additional information for distinguishing the different clusters:

This approach also allows us to perform some prediction on unknown data, by using methods such as k-NN on a low-dimensional embedding (since k-NN doesn't perform well in high dimensions) learned through this supervised approach.

As I will show in the results section, this approach proved to be very effective, achieving high accuracy comparable to some SOTA approaches, such as Clustering and SVM.

Models Implemented

To perform this binary classification task, we implemented several basic ML approaches, spacing from clustering (KMeans, GMM, DBSCAN and Spectral Clustering), to SVM (both in 2D - image below - and in high dimensionality), Logistic Regression, Random Forests and MLPs.

Results

Most of the implemented methods proved to be effective in classifying the SmartSeq Data, as it can be seen from the plot below, which shows Logistic Regression, MLP and Supervised Dimensionality Reduction.

Structure

The original code for the project can be found in the BAIology Notebook, which contains also a descriptive overview of the results submitted for the group project in the course Machine Learning for the Bachelor's Course Mathematical and Computing Sciences for Artificial Intelligence, Bocconi University, Spring 2023.
The revisited code is distributed over utils files contained in the Utils folder, and are then applied to the data obtained through the two sequencing techniques in the notebooks Dropseq.ipynb and SmartSeq.ipynb.

Acknowledgements

The original project was completed with Mattia Barbiere, Fabio Cantatore, Keshav Ganesh and Elena Kybett Vinci.

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
Images		Images
Utils		Utils
.gitignore		.gitignore
BAIology.ipynb		BAIology.ipynb
DropSeq.ipynb		DropSeq.ipynb
README.md		README.md
SmartSeq.ipynb		SmartSeq.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction on Cancer Cells

Abstract

Table of Contents

Data

Overview and Results

Data Analysis

Dimensionality Reduction

Models Implemented

Results

Structure

Acknowledgements

About

Releases

Packages

Languages

MikyLanfra/Prediction_on_Cancer_Cells

Folders and files

Latest commit

History

Repository files navigation

Prediction on Cancer Cells

Abstract

Table of Contents

Data

Overview and Results

Data Analysis

Dimensionality Reduction

Models Implemented

Results

Structure

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages