-
Notifications
You must be signed in to change notification settings - Fork 0
/
cmsc_320_final_project.py
396 lines (276 loc) · 24 KB
/
cmsc_320_final_project.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
# -*- coding: utf-8 -*-
"""CMSC 320 Final Project
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1iToEuz2W1kWgskdMnLkCh0uPplLGBA4X
# **Predictive Modeling for Breast Cancer Risk**
#### Authors: Sanjana Vellanki, Aasritha Sanikommu
## **Introduction**
Breast Cancer is one of the most common cancers in women today. Today, there is a chance that every 1 in 8 females born are likely to have breast cancer, which amount to a 13% for each woman. Over the years, predictive modeling for breast cancer detection and risk analysis have proven to aid immensely in early prognosis.
Currently, screening guidelines for women 'highly bi-annual screenings for women who are between the ages of 50-74 years old. However, because nearly 30% of breast cancer patients have shown recurrence of cancerous tissue after multiple follow ups, the faith in accuracy of screenings has decreased. Accurately predicting breast cancer risk can both encourage high-risk women for screenings who otherwise would not have gotten screened, and can also promote adherance to screening guidelines. Accurate breast cancer risk prediction can also improve screening methods for early detection of recurrence of cancerous tissue in patients, and also aid in indentifying new biomarkers.
In this project, we will be looking at various factors from patient data that have inlfuenced high risk in mortality from Breast Cancer. We will also generate a model that can identify certain factors in patient cancer screenings that are linked to high mortality rates.
We recognize we are not clinical professionals, and that our analysis may not provide statistically significant risk predictions. In clinical research models, scientists and researchers focus on more in depth analysis utilizing gene expressions found in cancerous tumors, which is currently out of our scope due to a lack of understanding. However, we believe that preliminary models such as the one we have built can spread awareness for certain notable biomarkers which may increaase risk in Breast Cancer, and also encourage women to get screenings more frequently.
We will be hitting the following points throughout the tutorial:
1. Data Collection
2. Data Cleaning & Preprocessing
3. Exploratory Visualizations
4. Hypothesis Testing & Model Building
5. Final Conclusions and Insights
## **Data Collection**
When sourcing our data, we wanted to look into two types of data sets:
1. Data which describes patient demographics (age, race, location, etc.)
2. Data which describes patient diagnostics that described the tumor found.
For both data sets, we wanted to look at which patients had higher chances of, or faced, death.
Our first data set was collected by the [SEER Program of the NCI (National Cancer Institute)](https://ieee-dataport.org/open-access/seer-breast-cancer-data). The data was published in November 2017, and provides patient demogrpahic data (along with status of mortality).
Our second data set was created by professors from the University of Wisconsin, and can be found at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29). This data set includes various attributes/descriptors for patients' tumors, and also includes a diagnoses (denoted as M for malignant for B for benign).
According to NCI, patients with Malignant tumors are 95% percent likley to face death within 1-5 years year of this diagnosis. Due to this, we decided to use tumor malignance as our understanding of a patient's likelihood for mortality.
Both data sets come in the form of a .csv file.
The following imports were used to aid in the process of our final project.
"""
import pandas as pd
import seaborn as sns
import missingno
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import pearsonr
from scipy.stats import norm
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from mpl_toolkits.mplot3d import Axes3D
import statsmodels.api as sm
import statsmodels.formula.api as smf
plt.style.use('fivethirtyeight')
sns.set_style("white")
plt.rcParams['figure.figsize'] = (8,4)
#Reading Data into Dataframes
demographic = pd.read_csv('SEER Breast Cancer Dataset .csv')
diagnostic = pd.read_csv('finalprojectdata.csv')
"""# **Data Cleaning & Preprocessing**
For this section of our tutorial, we will be focusing on cleaning and prepping our data for further future analysis. We will be working with both data sets in different ways to get the best analysis from both types of data. For the rest of this tutorial, **demographic** refers to the SEER breast cancer data and **diagnostic** refers to the UCI Wisconsin breast cancer data.
"""
demographic.isna().sum()
#drop null values
demographic.drop('Unnamed: 3', axis = 1, inplace = True)
#standardizing values
grade1 = ['grade II', "grade 1", 'well differentiated', 'differentiated' 'nos']
grade2 = ['grade II', 'grade ii', 'grade 2', 'moderately differentiated', 'moderately differentiated', 'intermediate differentiation']
grade3 = ['grade III', 'grade iii', 'grade 3', 'poorly differentiated', 'differentiated']
grade4 = ['grade IV', 'grade iv', 'grade 4', 'undifferentiated', 'anaplastic']
for i, row in demographic.iterrows():
if str(row['Race ']).split(' ')[0] == "Other":
demographic.at[i, 'Race '] = 'Other'
if str(row['Grade']).split(';')[0].lower() in grade1 :
demographic.at[i, 'Grade'] = 1
elif str(row['Grade']).split(';')[0].lower() in grade2 :
demographic.at[i, 'Grade'] = 2
elif str(row['Grade']).split(';')[0].lower() in grade3 :
demographic.at[i, 'Grade'] = 3
elif str(row['Grade']).split(';')[0].lower() in grade4 :
demographic.at[i, 'Grade'] = 4
demographic.at[i, 'T Stage '] = row['T Stage '][-1]
demographic.at[i, 'N Stage'] = row['N Stage'][-1]
demographic.replace('Positive', 1, inplace = True)
demographic.replace('Negative', 1, inplace = True)
diagnostic.rename(columns={"concave points_mean": "concave_points_mean", "concave points_worst": "concave_points_worst", "concave points_se": "concave_points_se"}, inplace = True)
diagnostic.isna().sum()
#drop null columns
diagnostic.drop('Unnamed: 32', axis = 1, inplace = True)
#drop unused columns
diagnostic.drop('id', axis = 1, inplace = True)
"""In order to get a better understanding of which factors have a higher effect on mortality in patients, we want to take a closer look at the features (qualitative and quantitative) that we are presented with in both data sets.
Right off the bat, we can see that the diagnostic data set has over 30 numerical features for each patient. Because we want to build a model that can help us identify some target features to look for in cancer screenings, we want to try to reduce the dimensionality of this dataset without losing vital information.
To do so, we will be using Principal Component Analysis (PCA), which will group together highly correlated variables into sets of uncorrelated variables in order to eliminate any colinearity between the features. We will get a better understanding of multicollinearity within the dataset as we progress further in the tutorial, but for now, we will prep a PCA model for our model building. You can learn more about principal component analysis (PCA) [here](https://builtin.com/data-science/step-step-explanation-principal-component-analysis#:~:text=What%20Is%20Principal%20Component%20Analysis,information%20in%20the%20large%20set.).
"""
# Label Encoding
#Assigning predictors to an array
array = diagnostic.values
X = array[:,1:31]
y = array[:,0]
#Transforming the Diagnosis variable (M, B) into integers (1,0) respectively
le = LabelEncoder()
y = le.fit_transform(y)
# Normalize the data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
X = scaler.fit_transform(X)
#transform normal data into table by feature number
feat_cols = ['feature'+str(i) for i in range(X.shape[1])]
normalised_breast = pd.DataFrame(X,columns=feat_cols)
normalised_breast.head()
#generate table for each entry denotinng PCA value for each entry
pca_breast = PCA(n_components=3)
pca_fit = pca_breast.fit_transform(X)
principal_breast_Df = pd.DataFrame(data = pca_fit, columns = ['principal component 1', 'principal component 2', 'principal component 3'])
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))
principal_breast_Df.head()
"""From the output above, we can see that the first principal component explains 44.3% of the variance, the second principal component explains 19% of the variance, and the last principal component explains around 9.4% percent of the variance. It is also important that projecting a thirty dimensional data set to a three dimensional data set has caused a loss of around 27.3% of the data. In order to reduce the amount of variance lost, more principal components can be added to the model. However, for our understanding, we will be utilizing three.
In order to see whether three principal components will suffice for our model, we will be building a scree plot with 10 principal components.
"""
#PCA wirh 10 components
pca_breast2 = PCA(n_components=10)
fit = pca_breast2.fit_transform(X)
#identifying explained variance for each component
variance = pca_breast2.explained_variance_ratio_
plt.plot(variance)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
leg = plt.legend(['Eigenvalues from PCA'], loc='best', borderpad=0.3,shadow=False,markerscale=0.4)
leg.get_frame().set_alpha(0.4)
plt.show()
"""A scree plot allows us to understand what the minimal number of principal components needed is to explain an optimal amount of variance in our dataset. This can be seen at the 'elbow' of the scree plot. Once the elbow has been reached, the amount of variance explained by the following principal components decreases steadily and asymptotes to 0. From this scree plot, we can see that the elbow is at 2. Therefore, for our upcoming model, we will be retaining the first three principal component (0,1,2). To learn more about interpreting scree plots with PCA look over [here](https://sanchitamangale12.medium.com/scree-plot-733ed72c8608).
"""
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('Principal Component - 1',fontsize=10)
ax.set_ylabel('Principal Component - 2',fontsize=10)
ax.set_zlabel('Principal Component - 3',fontsize=10)
ax.set_title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['B', 'M']
colors = ['r', 'g']
for target, color in zip(targets,colors):
indicesToKeep = diagnostic['diagnosis'] == target
ax.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1'], principal_breast_Df.loc[indicesToKeep, 'principal component 2'], principal_breast_Df.loc[indicesToKeep, 'principal component 3'], c = color, s = 50)
plt.legend(targets,prop={'size': 15})
plt.show()
"""When we plot the three principal components and their relation to our patient diagnoses, we can see that there is a visible distinction between patients who have malignant tumors and patietns who have benign tumors. Again, we can see that majority of the explained variance lies towards PC1, then PC2, and lastly PC3. We will be using this PCA model when building our predictive model.
# **Exploratory Visualizations**
In this section of our analysis, we will be further exploring the features that are in our data sets, and get a beter understanding of the relationships they hold.
We will first start with our **demographic** dataset. To reiterate, we want to see what factors hold a higher correlation to a patients death after being diagnosed with breast cancer. We can first take a look at a statistical summary of all of the values collected in out demographic dataset.
"""
demographic.describe()
"""Now lets take a closer look at the age and race values collected in our dataset and see how they relate to the patients survival. We may first observe a chart of the ages of our patients"""
demographic.head()
fig = plt.figure(figsize = (20,10))
sns.countplot(y = 'Age', data = demographic)
"""This summary allows us to understand that our data was collected on females between the ages of 30 and 69, with the avg age being 54.
Now, lets take a look at the race values
"""
race = demographic['Race ']
# Create a histogram of the column
plt.hist(race)
plt.xlabel('Race')
plt.title('Frequency of Race in Demographic Dataset')
plt.show()
"""Now, we want to see if the age or race of an individual affected the individuals survival and predicted # of survival months. Here, we can calculate the survival based on the race"""
race_groups = demographic.groupby('Race ')
proportions = race_groups['Status'].value_counts(normalize=True)
print(proportions)
"""This analysis allows us to see the survival rates of the different race groups. We can see that the patients that were Black died 25% of the time, which is 10% higher than White patients, who died around 15% of the time. This suggests that Black patients are more likely to face death from breast cancer.
Now, we may observe if Age had any impact on the survival months of our patients:
"""
age_bins = [30, 40, 50, 60, 70]
demographic['Age Range'] = pd.cut(demographic['Age'], age_bins)
proportions_age = demographic.groupby(['Age Range', 'Status'])['Status'].count() / demographic.groupby('Age Range')['Status'].count()
print(proportions_age)
"""This output allows us to observe that that the proportion of people who survived was highest in the age range of 50-60, where 12.8% of the people died.The proportion of people who survived was lowest in the age range of 60-70, where 20.1% of the people died. This output suggests that age did have an impact on the survival rate of our patients.
Now, that we have taken a closer look at some demographic values that may predict the survival of a patient, lets observethe **diagnostic** dataset, specifically the smoothness_se value to see if we are able to predict whether a tumor is malignant or benign.
"""
smoothness_bins = [0.001, 0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035]
diagnostic['smoothness_se_range'] = pd.cut(diagnostic['smoothness_se'], smoothness_bins)
proportions_smoothness = diagnostic.groupby(['smoothness_se_range', 'diagnosis'])['diagnosis'].count() / diagnostic.groupby('smoothness_se_range')['diagnosis'].count()
print(proportions_smoothness)
"""This output allows us to make several observations about this value and its relation to whether the tumor is malignant or benign. According to this data set, if a patient's tumor's smoothness_se ( the standard error of the tumor's cells' smoothness) value is between 0.03-0.035 then the tumor is highly likley to be malignant. If a patient's tumor's smoothness_se value is between 0.015-0.02 then the tumor is highly likley to be benign. As the smoothness_se value increases from 0.001 to 0.02, the larger the value, the more likley it is for the tumor to be benign. This output shows us that in more analysis, we may be able to use the smoothness_se value to predict whether an individual's tumor is malignant or begning and inturn predict the patient's survival
# **Hypothesis Testing and Model Building**
For this portion of our tutorial, we will now focus on understanding what models best fit our data, and generating a predictive model to determine whether a patient has a malignant or benign tumor.
We wil first start by generating a linear regression model to understand whether or not there is any linear correlation between the diagnosis of a patient's tumor, and factors observed from their screenings.
"""
diagnostic.replace(['M', 'B'], [1, 0], inplace = True)
yvar = diagnostic[['diagnosis']]
xvar = diagnostic.iloc[:,-30:]
xvar = sm.add_constant(xvar)
LinearModel = sm.OLS(yvar, xvar).fit()
print(LinearModel.summary())
"""From this table we can see the positive/negative correlations between the various tumor descriptors and the likelihood of a tumor being malignant or benign. However, the p-values for some of these coefficients is significantly high (let's use the threshold of 0.05). In order to get a better understanding of which factors hold statistically significance, let's only look at the coefficients that have a p value >0.05. To understand more about the interpretation of p-values, take a look at this article [here](https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/)!"""
p_values = LinearModel.pvalues
params = LinearModel.params
coeffs = p_values[p_values <= 0.05]
coeffs = coeffs.index.tolist()
sig_coeffs = LinearModel.params[coeffs][1:]
fig, ax = plt.subplots(figsize = (12,12))
ax.barh(sig_coeffs.index.tolist(), width = sig_coeffs)
plt.title('Coefficient of each Feature')
ax.set_ylabel("Factor")
ax.set_xlabel("Coefficient")
"""This plot demonstrates the correlation each statistically significant feature has in determining whether a tumor is malignant or benign. We can see here that smoothness_se (the standard error of the variation in radius lengths along each cell nucleus examined) has the highest correlation. The compactness_mean (the (perimeter^2 / area - 1.0) of each cell nucleus examined) has the next highest correlation. This suggests that when women get their screenings, doctors should pay close attention to the compactness and the radi of each cell nucleus within the tumor examined. Ofcourse this may be hard to do by eye, but an image classifier may be able to compute these values for evaluation.
However, based on this Linear Regression, we can see that it's not a strong model for predicition. The Linear Regression Model has an R-squared value of 0.774, which suggests that only 77.4% of the data is explained through this model.
Also recall we previously generated a PCA model to eliminate possiblity of multicollinearity within our model.
In order to procude a more accurate prediction model, let's try using Logistic Regression with our previoulsy generate PCA.
"""
diagnosis = diagnostic[['diagnosis']]
new_df = principal_breast_Df.join(diagnosis, how = 'outer')
new_df.rename(columns={"principal component 1": "PC1", "principal component 2": "PC2", "principal component 3": "PC3"}, inplace = True)
yvar = new_df[['diagnosis']]
xvar = new_df.iloc[:, :3]
xvar = sm.add_constant(xvar)
LogisticModel = sm.Logit(yvar, xvar).fit()
print(LogisticModel.summary())
"""As it can be seen here, the Logistic Regression model has a better R-sqaured value than the Linear Regression Model (rouhgly 84%). It can also be seen that all the principal components here are significant."""
p_values = LogisticModel.pvalues
params = LogisticModel.params
coeffs = p_values[p_values <= 0.05]
coeffs = coeffs.index.tolist()
sig_coeffs = LogisticModel.params[coeffs][1:]
fig, ax = plt.subplots(figsize = (8,8))
ax.barh(sig_coeffs.index.tolist(), width = sig_coeffs)
plt.title('Coefficient of each Feature')
ax.set_ylabel("Factor")
ax.set_xlabel("Coefficient")
"""This aligns with our principal component analysis. We can see that Principal Component 1 explains the highest amount of variance when identifying whether or not the tumor is malignant. To identify which specific features contribute the greatest weight to each principal component, we can go back to look at the linear coefficients of our PCA."""
df = pd.DataFrame(data = X, columns = diagnostic.columns[1:])
pca_breast.fit_transform(df)
attr = pd.DataFrame(pca_breast.components_,columns=df.columns,index = ['PC1','PC2', 'PC3'])
nrows, ncols = attr.shape
attr = attr.abs()
fig, axes = plt.subplots(nrows=nrows, ncols=1, figsize=(30, 50*nrows))
for i, (index, row) in enumerate(attr.iterrows()):
axes[i].barh(row.index, row)
axes[i].set_title(index)
fig.tight_layout()
plt.show()
"""From this visualization we can see which principal component is mostly aligned with what feature. Note that the sign of each coefficient does not matter here (hence the absolute value), since the sign does not affect the amount of variance in each component. Because it was shown from our previous linear regression that PC1 has the highest coefficient, we will look at the top three factors in PC1 that hold the greatest variance:
1. Fractal Dimension (Worst)
2. Symmetry (worst)
3. Concave Points (worst)
Here, 'worst' refers to the mean of the three largest values observed for each cell nucleus within each category.
Fractal Dimension for a tumor is related to the structural porosity of the tissue. As carcinogenesis occurs, the fractal dimension of the tissue increases, thus permitting the diagnosis of malignancy.
Symmetry refers to the symmetrical properties of a cell. Morphologically, the cancerous cell is characterized by a large nucleus, having an irregular size and shape. Hence, symmetrial properties (or the lack there of) hold a great impact on identifying whether a cell is malignant or benign.
Lastly, concanve points refers to the number of concave proportions of the contour. This closely related to cancerous cells having asymmetrical structures.
Thus, it makes sense why these three factors hold high importance in identifying malignance.
We will now build our predictive model to guage whether or not a patient has a malignant or benign tumor using logistic regression and our PCA.
"""
X = new_df
y = new_df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
LogisticModel = sm.Logit(y_train, X_train).fit()
yhat = LogisticModel.predict(X_test)
prediction = list(map(round, yhat))
cfm = confusion_matrix(y_test, prediction)
true_negative = cfm[0][0]
false_positive = cfm[0][1]
false_negative = cfm[1][0]
true_positive = cfm[1][1]
print('Confusion Matrix: \n', cfm, '\n')
print('True Negative:', true_negative)
print('False Positive:', false_positive)
print('False Negative:', false_negative)
print('True Positive:', true_positive)
print('Correct Predictions = ', accuracy_score(y_test, prediction) * 100, '%')
"""From this correlation matrix we can see that our model does a pretty good job at predicing malignancy in tumors! The Logistic Regression model here has an accuracy rate of roughly 96.5%, which suggests that our Logistic Regression model was a much better fit for the data than a Linear Regression model.
# Insights and Conclusions
The aim of this project was to understand what factors may affect a certain individual's likelihood for being gravely affected by breast cancer. From our exploratory analysis, we were able to see that majority of the women who were already affected by breast cancer were around 54 years of age. Of those women, the death rates for women between the ages of 60-70 was the highest. This information may urge women who are in their early 30s-40s to get regular screenings for breast cancer in order to avoid late diagnosis. We also discovered that individuals in the Black community seem to be at a higher risk for death, and we may urge them to take earlier precautions. We were also able to conclude from our hypothesis testing that from screening data, certain factors such as the fractal dimension, the symmetry, and the concavity of tumor cells are key traits in identifying whether a tumor is malignant or benign.
We hope that this tutorial gave you a better insight as to what factors may influence patients' mortality in regards to breast cancer. Although a rather brutal topic, we hope that this preliminary analysiz may increase awareness on the importance of breast cancer screening for women, and may even urge women who have higher chances of malignancy to talk to their doctors for prevention techniques.
# Sources and Articles for Further Speculation
[Risk Prediction Models for Breast Cancer](https://academic.oup.com/jbi/article/3/2/144/6144904)
[Tumor Detection with Machine Learning](https://www.sciencedirect.com/topics/computer-science/tumor-detection)
[Principal Component Analysis in Clinial Studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5599285/)
[Machine Learning in Oncology](https://ascopubs.org/doi/full/10.1200/CCI.20.00072)
"""