-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
101 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
title: "The Adjusted Rand Index" | ||
date: "`r Sys.Date()`" | ||
output: | ||
workflowr::wflow_html: | ||
toc: false | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
library(tidyverse) | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
In my last post, I wrote about the [Rand index](rand_index.html). This post will be on the [Adjusted Rand index](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index) (ARI), which is the corrected-for-chance version of the Rand index: | ||
|
||
$$ AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex} $$ | ||
|
||
From Wikipedia: | ||
|
||
> The adjusted Rand index is the corrected-for-chance version of the Rand index. Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model. | ||
> | ||
> Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in _k_-means the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings. | ||
> | ||
> Though the Rand Index may only yield a value between 0 and 1, the adjusted Rand index can yield negative values if the index is less than the expected index. | ||
|
||
Given a set $S$ of $n$ elements, and two groupings or partitions (e.g. clusterings) of these elements, namely $X = \{X_1, X_2, \ldots, X_r\}$ and $Y = \{Y_1, Y_2, \ldots, Y_s\}$, the overlap between $X$ and $Y$ can be summarised in a contingency table $[n_{ij}]$ where each entry $n_{ij}$ denotes the number of objects in common between $X_i$ and $Y_j : n_{ij} = | X_i \cap Y_j |.$ | ||
|
||
| |$Y_1$ | $Y_2$ | $\cdots$ | $Y_s$ | $Sums$ | | ||
|:- |:- |:- |:- |:- |:- | | ||
|$X_1$ | $n_{11}$ | $n_{12}$ | $\cdots$ | $n_{1s}$ | $a_1$ | | ||
|$X_2$ | $n_{21}$ | $n_{22}$ | $\cdots$ | $n_{2s}$ | $a_2$ | | ||
|$\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | $\vdots$ | | ||
|$X_r$ | $n_{r1}$ | $n_{r2}$ | $\cdots$ | $n_{rs}$ | $a_r$ | | ||
|$Sums$ | $b_1$ | $b_2$ | $\cdots$ | $b_s$ || | ||
|
||
the adjusted index is: | ||
|
||
$$ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } $$ | ||
|
||
where $n_{ij}$, $a_i$, $b_j$ are values from the contingency table. | ||
|
||
As per usual, it'll be easier to understand with an example. I'll use R to create two random sets of elements, which represent clustering results. | ||
|
||
```{r} | ||
x <- c(1, 2, 3, 3, 2, 1, 1, 3, 3, 1, 2, 2) | ||
y <- c(3, 2, 3, 2, 2, 1, 1, 2, 3, 1, 3, 1) | ||
|
||
which(x == 1 & x == y) | ||
``` | ||
|
||
In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. $n_{11}$ would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements. | ||
|
||
Here's the full contingency table: | ||
| |$Y_1$ | $Y_2$ | $Y_3$ | $Row sums$ | | ||
|:-|:- |:- |:- |:- | | ||
| $X_1$ | $3$ | $0$ | $1$ | $4$ | | ||
| $X_2$ | $1$ | $2$ | $1$ | $4$ | | ||
| $X_3$ | $0$ | $2$ | $2$ | $4$ | | ||
| $Column sums$ | $4$ | $4$ | $4$ || | ||
```{r contingency_table} | ||
table(x, y) |> | ||
addmargins() | ||
``` | ||
If you look closely at the ARI formula, there's really just three different parts: | ||
|
||
* $\sum_{ij} { {n_{ij}}\choose{2} }$ | ||
* $\sum_{i} { {a_{i}}\choose{2} }$ | ||
* $\sum_{j} { {b_{j}}\choose{2} }$ | ||
|
||
$\sum$ means the sum, $i$ refers to the row number, $j$ refers to the column number, $a$ refers to the row sum, and $b$ refers to the column sum. Now let's work out each part. | ||
* $\sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6$ | ||
* $\sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$ | ||
* $\sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$ | ||
Substituting the values into the ARI formula we get: | ||
$$ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333$$ | ||
## Using mclust::adjsutedRandIndex() | ||
The [{mclust} package](https://cran.r-project.org/web/packages/mclust/index.html) contains the `adjustedRandIndex()` function that can calculate the Adjusted Rand index. | ||
```{r message=FALSE, warning=FALSE} | ||
if(!require("mclust")){ | ||
install.packages("mclust") | ||
} | ||
set.seed(1) | ||
x <- sample(x = rep(1:3, 4), 12) | ||
set.seed(2) | ||
y <- sample(x = rep(1:3, 4), 12) | ||
suppressMessages( | ||
mclust::adjustedRandIndex(x, y) | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters