Breast-Cancer.Rmd

---
title: "Breast Cancer - Unsupervised learning"
output:
  html_document:
    df_print: paged
---

Human breast data:
- Ten features measured of each cell nuclei
- Summary information is provided for each group of cells
- Includes diagnosis: benign (not cancerous) and malignant (cancerous)

## Preparing the data

```{r}
file_path <- "WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(file_path)

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[,3:32])

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
```

## Quick exploratory data analysis

```{r}
summary(wisc.data)

# How many observations are in this dataset?
nrow(wisc.data)

# How many of the observations have a malignant diagnosis?
sum(diagnosis)
```

## Performing PCA


Check if the data need to be scaled before performing PCA: 

```{r}
# Check column means and standard deviations
print("Means:")
colMeans(wisc.data)

print("SD:")
apply(wisc.data, 2, sd)
```
The input variables use different units of measurement.
The input variables have significantly different variances.
Scaling is then approriate.


```{r}
# Execute PCA, with scaling
wisc.pr <- prcomp(wisc.data, center = TRUE, scale = TRUE)

# Look at summary of results
summary(wisc.pr)
```


## Interpreting PCA results


```{r}
# Create a biplot of wisc.pr
biplot(wisc.pr)

# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")

# Scatter plot observations by components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

# Scatter plot observations by components 2 and 3
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1), 
     xlab = "PC2", ylab = "PC3")
```

Because principal component 2 explains more variance in the original data than principal component 3, we can see that the first plot has a cleaner cut separating the two subgroups.


## Variance explained

We will produce scree plots showing the proportion of variance explained as the number of principal components increases, in order to identify a natural number of principal components to keep.

```{r}
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")
```
There is no obvious elbow but the minimum number of principal components required to explain 80% of the variance of the data is 5.


## Hierarchical clustering of case data

```{r}
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)

# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")

# Plot the hierarchical clustering model as dendrogram
plot(wisc.hclust)
```

The clustering model has 4 clusters at height 20.


## Selecting number of clusters

We will now compare the outputs from the hierarchical clustering model to the actual diagnoses, in order to check the performance of the clustering model.

```{r}
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
```

Four clusters were picked after some exploration. Before moving on, we may want to explore how different numbers of clusters affect the ability of the hierarchical clustering to separate the different diagnoses.


## k-means clustering and comparing results

There are two main types of clustering: hierarchical and k-means.
We will then create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of the hierarchical clustering model.

```{r}
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)

# Compare k-means to hierarchical clustering
table(wisc.km$cluster, wisc.hclust.clusters)
```

Looking at the second table generated, it looks like clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.

 
## Clustering on PCA results

The PCA model required significantly fewer features to describe 80% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.

Let's see if PCA improves or degrades the performance of hierarchical clustering.

```{r}
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)

# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)

# Compare to k-means and hierarchical
table(diagnosis, wisc.km$cluster)
table(diagnosis, wisc.hclust.clusters)
```

We can conclude that PCA improves the performance of hierarchical clustering.