Adjusted Rand Index

davetang · Jun 19, 2024 · 0c163e0 · 0c163e0
1 parent 8379f6f
commit 0c163e0
Show file tree

Hide file tree

Showing 2 changed files with 101 additions and 0 deletions.
diff --git a/analysis/adjusted_rand_index.Rmd b/analysis/adjusted_rand_index.Rmd
@@ -0,0 +1,100 @@
+---
+title: "The Adjusted Rand Index"
+date: "`r Sys.Date()`"
+output:
+  workflowr::wflow_html:
+    toc: false
+---
+
+```{r setup, include=FALSE}
+library(tidyverse)
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+In my last post, I wrote about the [Rand index](rand_index.html). This post will be on the [Adjusted Rand index](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index) (ARI), which is the corrected-for-chance version of the Rand index:
+
+$$ AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex} $$
+
+From Wikipedia:
+
+> The adjusted Rand index is the corrected-for-chance version of the Rand index. Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model.
+>
+> Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in _k_-means the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.
+>
+> Though the Rand Index may only yield a value between 0 and 1, the adjusted Rand index can yield negative values if the index is less than the expected index.
+
+Given a set $S$ of $n$ elements, and two groupings or partitions (e.g. clusterings) of these elements, namely $X = \{X_1, X_2, \ldots, X_r\}$ and $Y = \{Y_1, Y_2, \ldots, Y_s\}$, the overlap between $X$ and $Y$ can be summarised in a contingency table $[n_{ij}]$ where each entry $n_{ij}$ denotes the number of objects in common between $X_i$ and $Y_j : n_{ij} = | X_i \cap Y_j |.$
+
+|   |$Y_1$ | $Y_2$ | $\cdots$ | $Y_s$ | $Sums$ |
+|:- |:-    |:-     |:-        |:-     |:-      |
+|$X_1$ | $n_{11}$ | $n_{12}$ | $\cdots$ | $n_{1s}$ | $a_1$ |
+|$X_2$ | $n_{21}$ | $n_{22}$ | $\cdots$ | $n_{2s}$ | $a_2$ |
+|$\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | $\vdots$ |
+|$X_r$ | $n_{r1}$ | $n_{r2}$ | $\cdots$ | $n_{rs}$ | $a_r$ | 
+|$Sums$ | $b_1$ | $b_2$ | $\cdots$ | $b_s$ ||
+
+the adjusted index is:
+
+$$ARI = \frac{ \sum_{ij} { {n_{ij}}\choose{2} } - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } { \frac{1}{2} [ \sum_{i} { a_{i}\choose{2} } + \sum_{j} { {b_{j}}\choose{2} } ] - [ \sum_{i} { {a_{i}}\choose{2} } \sum_{j} { {b_{j}}\choose{2} } ] / { {n}\choose{2} } } $$
+
+where $n_{ij}$, $a_i$, $b_j$ are values from the contingency table.
+
+As per usual, it'll be easier to understand with an example. I'll use R to create two random sets of elements, which represent clustering results.
+
+```{r}
+x <- c(1, 2, 3, 3, 2, 1, 1, 3, 3, 1, 2, 2)
+y <- c(3, 2, 3, 2, 2, 1, 1, 2, 3, 1, 3, 1)
+
+which(x == 1 & x == y)
+```
+
+In this example there are 3 clusters in both sets, so our contingency table will have three rows and three columns. We just need to count the co-occurrences to build the contingency table. $n_{11}$ would be the number of times an element occurs in cluster 1 of X and cluster 1 of Y; this occurs three times: the sixth, seventh, and tenth elements.
+
+Here's the full contingency table:
+
+|  |$Y_1$ | $Y_2$ | $Y_3$ | $Row sums$ |
+|:-|:-    |:-     |:-     |:-          |
+| $X_1$ | $3$ | $0$ | $1$ | $4$ |
+| $X_2$ | $1$ | $2$ | $1$ | $4$ |
+| $X_3$ | $0$ | $2$ | $2$ | $4$ |
+| $Column sums$ | $4$ | $4$ | $4$ ||
+
+```{r contingency_table}
+table(x, y) |>
+  addmargins()
+```
+
+If you look closely at the ARI formula, there's really just three different parts:
+
+* $\sum_{ij} { {n_{ij}}\choose{2} }$
+* $\sum_{i} { {a_{i}}\choose{2} }$
+* $\sum_{j} { {b_{j}}\choose{2} }$
+
+$\sum$ means the sum, $i$ refers to the row number, $j$ refers to the column number, $a$ refers to the row sum, and $b$ refers to the column sum. Now let's work out each part.
+
+* $\sum_{ij} { {n_{ij}}\choose{2} } = { {3}\choose{2} } + { {0}\choose{2} } + { {1}\choose{2} } + { {1}\choose{2} } + { {2}\choose{2} } + { {1}\choose{2} } + { {0}\choose{2} } + { {2}\choose{2} } + { {2}\choose{2} } = 3 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 6$
+* $\sum_{i} { {a_{i}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$
+* $\sum_{j} { {b_{j}}\choose{2} } = { {4}\choose{2} } + { {4}\choose{2} } + { {4}\choose{2} } = 6 + 6 + 6 = 18$
+
+Substituting the values into the ARI formula we get:
+
+$$ARI = \frac{6 - [18 \times 18] / { {12}\choose{2} } } {\frac{1}{2} [18+18] - [18 \times 18] / { {12}\choose{2} }} = \frac{6 - 4.909091}{18 - 4.909091} = 0.08333333$$
+
+## Using mclust::adjsutedRandIndex()
+
+The [{mclust} package](https://cran.r-project.org/web/packages/mclust/index.html) contains the `adjustedRandIndex()` function that can calculate the Adjusted Rand index.
+
+```{r message=FALSE, warning=FALSE}
+if(!require("mclust")){
+  install.packages("mclust")
+}
+
+set.seed(1)
+x <- sample(x = rep(1:3, 4), 12)
+set.seed(2)
+y <- sample(x = rep(1:3, 4), 12)
+
+suppressMessages(
+  mclust::adjustedRandIndex(x, y)
+)
+```
diff --git a/analysis/index.Rmd b/analysis/index.Rmd
@@ -74,6 +74,7 @@ I refer to [my blog](https://davetang.org/muse/) often to look up notes and code
 * Euclidean vs. Cosine Distance - original / [updated](cosine.html)
 * Linear algebra basics - original / [updated](linear_algebra.html)
 * The Rand index - [original](https://davetang.org/muse/2017/09/21/the-rand-index/) / [updated](rand_index.html)
+* The Adjusted Rand index - [original](https://davetang.org/muse/2017/09/21/adjusted-rand-index/) / [updated](adjusted_rand_index.html)
 
 ### Misc