Skip to content

Commit

Permalink
Adjusted Rand Index
Browse files Browse the repository at this point in the history
  • Loading branch information
davetang committed Jun 19, 2024
1 parent 8379f6f commit 0c163e0
Show file tree
Hide file tree
Showing 2 changed files with 101 additions and 0 deletions.
100 changes: 100 additions & 0 deletions analysis/adjusted_rand_index.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: "The Adjusted Rand Index"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html:
toc: false
---

```{r setup, include=FALSE}
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```

In my last post, I wrote about the [Rand index](rand_index.html). This post will be on the [Adjusted Rand index](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index) (ARI), which is the corrected-for-chance version of the Rand index:

$$ AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex} $$

From Wikipedia:

> The adjusted Rand index is the corrected-for-chance version of the Rand index. Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model.
>
> Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in _k_-means the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.
>
> Though the Rand Index may only yield a value between 0 and 1, the adjusted Rand index can yield negative values if the index is less than the expected index.

Given a set $S$ of $n$ elements, and two groupings or partitions (e.g. clusterings) of these elements, namely $X = \{X_1, X_2, \ldots, X_r\}$ and $Y = \{Y_1, Y_2, \ldots, Y_s\}$, the overlap between $X$ and $Y$ can be summarised in a contingency table $[n_{ij}]$ where each entry $n_{ij}$ denotes the number of objects in common between $X_i$ and $Y_j : n_{ij} = | X_i \cap Y_j |.$

| |$Y_1$ | $Y_2$ | $\cdots$ | $Y_s$ | $Sums$ |
|:- |:- |:- |:- |:- |:- |
|$X_1$ | $n_{11}$ | $n_{12}$ | $\cdots$ | $n_{1s}$ | $a_1$ |
|$X_2$ | $n_{21}$ | $n_{22}$ | $\cdots$ | $n_{2s}$ | $a_2$ |
|$\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | $\vdots$ |
|$X_r$ | $n_{r1}$ | $n_{r2}$ | $\cdots$ | $n_{rs}$ | $a_r$ |
|$Sums$ | $b_1$ | $b_2$ | $\cdots$ | $b_s$ ||

the adjusted index is:

$$ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } $$

where $n_{ij}$, $a_i$, $b_j$ are values from the contingency table.

As per usual, it'll be easier to understand with an example. I'll use R to create two random sets of elements, which represent clustering results.

```{r}
x <- c(1, 2, 3, 3, 2, 1, 1, 3, 3, 1, 2, 2)
y <- c(3, 2, 3, 2, 2, 1, 1, 2, 3, 1, 3, 1)

which(x == 1 & x == y)
```

In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. $n_{11}$ would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements.

Here's the full contingency table:
| |$Y_1$ | $Y_2$ | $Y_3$ | $Row sums$ |
|:-|:- |:- |:- |:- |
| $X_1$ | $3$ | $0$ | $1$ | $4$ |
| $X_2$ | $1$ | $2$ | $1$ | $4$ |
| $X_3$ | $0$ | $2$ | $2$ | $4$ |
| $Column sums$ | $4$ | $4$ | $4$ ||
```{r contingency_table}
table(x, y) |>
addmargins()
```
If you look closely at the ARI formula, there's really just three different parts:

* $\sum_{ij} { {n_{ij}}\choose{2} }$
* $\sum_{i} { {a_{i}}\choose{2} }$
* $\sum_{j} { {b_{j}}\choose{2} }$

$\sum$ means the sum, $i$ refers to the row number, $j$ refers to the column number, $a$ refers to the row sum, and $b$ refers to the column sum. Now let's work out each part.
* $\sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6$
* $\sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$
* $\sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$
Substituting the values into the ARI formula we get:
$$ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333$$
## Using mclust::adjsutedRandIndex()
The [{mclust} package](https://cran.r-project.org/web/packages/mclust/index.html) contains the `adjustedRandIndex()` function that can calculate the Adjusted Rand index.
```{r message=FALSE, warning=FALSE}
if(!require("mclust")){
install.packages("mclust")
}
set.seed(1)
x <- sample(x = rep(1:3, 4), 12)
set.seed(2)
y <- sample(x = rep(1:3, 4), 12)
suppressMessages(
mclust::adjustedRandIndex(x, y)
)
```
1 change: 1 addition & 0 deletions analysis/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ I refer to [my blog](https://davetang.org/muse/) often to look up notes and code
* Euclidean vs. Cosine Distance - original / [updated](cosine.html)
* Linear algebra basics - original / [updated](linear_algebra.html)
* The Rand index - [original](https://davetang.org/muse/2017/09/21/the-rand-index/) / [updated](rand_index.html)
* The Adjusted Rand index - [original](https://davetang.org/muse/2017/09/21/adjusted-rand-index/) / [updated](adjusted_rand_index.html)

### Misc

Expand Down

0 comments on commit 0c163e0

Please sign in to comment.