-
Notifications
You must be signed in to change notification settings - Fork 0
/
Breast-Cancer.Rmd
222 lines (130 loc) · 5.87 KB
/
Breast-Cancer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Breast Cancer - Unsupervised learning"
output:
html_document:
df_print: paged
---
Human breast data:
- Ten features measured of each cell nuclei
- Summary information is provided for each group of cells
- Includes diagnosis: benign (not cancerous) and malignant (cancerous)
## Preparing the data
```{r}
file_path <- "WisconsinCancer.csv"
# Download the data: wisc.df
wisc.df <- read.csv(file_path)
# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[,3:32])
# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
```
## Quick exploratory data analysis
```{r}
summary(wisc.data)
# How many observations are in this dataset?
nrow(wisc.data)
# How many of the observations have a malignant diagnosis?
sum(diagnosis)
```
## Performing PCA
Check if the data need to be scaled before performing PCA:
```{r}
# Check column means and standard deviations
print("Means:")
colMeans(wisc.data)
print("SD:")
apply(wisc.data, 2, sd)
```
The input variables use different units of measurement.
The input variables have significantly different variances.
Scaling is then approriate.
```{r}
# Execute PCA, with scaling
wisc.pr <- prcomp(wisc.data, center = TRUE, scale = TRUE)
# Look at summary of results
summary(wisc.pr)
```
## Interpreting PCA results
```{r}
# Create a biplot of wisc.pr
biplot(wisc.pr)
# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC2")
# Scatter plot observations by components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC3")
# Scatter plot observations by components 2 and 3
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1),
xlab = "PC2", ylab = "PC3")
```
Because principal component 2 explains more variance in the original data than principal component 3, we can see that the first plot has a cleaner cut separating the two subgroups.
## Variance explained
We will produce scree plots showing the proportion of variance explained as the number of principal components increases, in order to identify a natural number of principal components to keep.
```{r}
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Calculate variability of each component
pr.var <- wisc.pr$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
```
There is no obvious elbow but the minimum number of principal components required to explain 80% of the variance of the data is 5.
## Hierarchical clustering of case data
```{r}
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)
# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)
# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")
# Plot the hierarchical clustering model as dendrogram
plot(wisc.hclust)
```
The clustering model has 4 clusters at height 20.
## Selecting number of clusters
We will now compare the outputs from the hierarchical clustering model to the actual diagnoses, in order to check the performance of the clustering model.
```{r}
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)
# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
```
Four clusters were picked after some exploration. Before moving on, we may want to explore how different numbers of clusters affect the ability of the hierarchical clustering to separate the different diagnoses.
## k-means clustering and comparing results
There are two main types of clustering: hierarchical and k-means.
We will then create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of the hierarchical clustering model.
```{r}
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)
# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
# Compare k-means to hierarchical clustering
table(wisc.km$cluster, wisc.hclust.clusters)
```
Looking at the second table generated, it looks like clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.
## Clustering on PCA results
The PCA model required significantly fewer features to describe 80% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.
Let's see if PCA improves or degrades the performance of hierarchical clustering.
```{r}
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")
# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)
# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)
# Compare to k-means and hierarchical
table(diagnosis, wisc.km$cluster)
table(diagnosis, wisc.hclust.clusters)
```
We can conclude that PCA improves the performance of hierarchical clustering.