Context

Hello! This is my second data science project. Here, I make machine learning models to predict whether a breast cancer is benign or malignant based on the available features. I try to implement what I have learned so far: standard data cleaning, EDA, classification model, hyperparameter tuning, and dimensionality reduction with PCA.

Files

This repo contains this README, a python notebook called breast_cancer.ipynb, a slides in pdf format to highlight the most important aspects of this project called 'Breast Cancer Prediction.pdf', and the dataset data.csv is in the data directory.

Data

I use the Breast Cancer Wisconsin dataset obtained from Kaggle. This dataset is also one of the toy datasets readily available in the scikit-learn package. The content of this dataset is as follows:

Attribute columns:

id - the patient's ID number
diagnosis - (M = malignant, B = benign)

Ten real-valued features directly computed from images:
a) radius - mean of distances from center to points on the perimeter
b) texture - standard deviation of gray-scale values
c) perimeter
d) area
e) smoothness - local variation in radius lengths
f) compactness - computed as perimeter^2 / area - 1.0
g) concavity - severity of concave portions of the contour
h) concave points - number of concave portions of the contour
i) symmetry
j) fractal_dimension - computed as "coastline approximation" - 1

In addition, each of these ten-valued features have three measurements: the mean value, standard error, and "worst" or largest (mean of the three largest values) of these features,resulting in 30 features. For instance, field 3 is Mean Radius (radius_mean), field 13 is Radius SE (radius_se), and field 23 is Worst Radius (radius_worst).

All feature values are recoded with four significant digits.

Methods

I use point biserial correlation to pick the stronger predictor, i.e. _worst. Since the two classes are linearly separable, I use logistic regression and SVM as the classifier. To get the optimum results, I employ the grid search method to obtain the best hyperparameters for each model. Next, I use PCA to reduce the dimensionality of the data to see if the results can be improved.

To run the .ipynb file, the standard numpy, pandas, matplotlib, seaborn, and sklearn packages are required.

Results

Both logistic regression and SVM perform extremely well on this dataset, with a recall score of 98% for logistic regression and a slightly lower score of 97% for SVM. There are no significant improvements on the recall scores if the dimensions are reduced with PCA, but the feature selection is made very efficient and easy.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
Breast Cancer Prediction.pdf		Breast Cancer Prediction.pdf
README.md		README.md
breast_cancer.ipynb		breast_cancer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context

Files

Data

Methods

Results

About

Releases

Packages

Languages

mrafifrbbn/breast-cancer-prediction

Folders and files

Latest commit

History

Repository files navigation

Context

Files

Data

Methods

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages