Introduction +
+The empirical quantile-quantile plot (QQ plot) is probably +one of the most underused and least appreciated plots in univariate +analysis. It is used to compare two distributions across their +full range of values. It is a generalization of the boxplot in +that it does not limit the comparison to just the median and upper and +lower quartiles. In fact, it compares all values by matching +each value in one batch to its corresponding quantile in the other +batch. The sizes of each batch need not be the same. If they differ, the +larger batch is interpolated down to the smaller batch’s set of +quantiles.
+A QQ plot can not only help visualize the differences in +distributions, but it can also model the relationship between both +batches. Note that this is not to be confused with modeling the +relationship between a bivariate dataset where the latter pairs up the +points by observational units whereas a QQ plot pairs up the +values by their “matching quantiles”.
+Anatomy of the eda_qq
plot
+
+
+#> [1] "Suggested offsets:y = x * 1.4574 + (0.9914)"
+-
+
- Each “dot” represents matching quantiles from each batch. +
- The shaded boxes represent each batch’s interquartile +ranges (mid 50% of values). +
- The Solid dashed lines inside the shaded boxes represent +each batch’s medians. +
- The lightly shaded dashed dots represent each batch’s +12.5th and 87.5th quantiles (i.e. they show the +ends of the mid 80% of values). +
- The upper right-hand text indicates the power
+transformation applied to the both batches (default is a power of
+
1
which is original measurement scale). If a formula is +applied to one or both batches, it too will appear in the upper +right-hand text.
+ - The
eda_qq
will also output the suggested +relationship between the y variable and the x variable. It bases +this on each batch’s interquartile values.
+
An overview of some of the function arguments +
+Data type +
+The function will accept a dataframe with a values column
+and a group column, or it will accept two separate vector
+objects. For example, to pass two separate vector object,
+x
and y
, type:
If the data are in a dataframe, type:
+
+dat <- data.frame(val = c(x, y), cat = rep(c("x", "y"), each = 30))
+eda_qq(dat, val, cat)
Suppressing plot +
+You can suppress the plot and have the x and y values outputted to a +list. If the batches did not match in size, the output will show their +interpolated values such that the output batches match in size.
+The output will also include the power parameter applied to both
+batches as well as any formula applied to one or both batches
+(fx
is the formula applied to the x variable and
+fy
is the formula applied to the y variable).
+out <- eda_qq(x,y, plot = FALSE)
+#> [1] "Suggested offsets:y = x * 1.1201 + (0.5618)"
+out
+#> $x
+#> [1] -2.0207122 -1.6048333 -1.5620907 -1.5128732 -1.3126378 -1.1770882
+#> [7] -1.0871906 -0.9258832 -0.8896555 -0.6152073 -0.3140113 -0.2996734
+#> [13] -0.2954234 -0.2199849 -0.2108781 0.1202060 0.2608893 0.2680445
+#> [19] 0.2910663 0.4239690 0.4262605 0.4301416 0.5176361 0.6085180
+#> [25] 0.6880919 0.6929772 0.7640838 0.9037644 1.0124869 1.0503544
+#>
+#> $y
+#> [1] -2.10710669 -1.30465821 -1.17618932 -1.17253191 -0.86423268 -0.40162859
+#> [7] -0.37002087 -0.19629536 -0.07210822 -0.01829722 0.05826287 0.09105884
+#> [13] 0.09371398 0.13698992 0.22318119 0.43006689 0.52597363 0.72665767
+#> [19] 0.81351407 1.00612388 1.01831440 1.06713353 1.23708449 1.24360530
+#> [25] 1.33232007 1.43973056 1.59125312 1.64852115 1.77464625 1.93823390
+#>
+#> $p
+#> [1] 1
+#>
+#> $fx
+#> NULL
+#>
+#> $fy
+#> NULL
Setting the grey box and dashed line parameters +
+The grey box highlights the interquartile ranges for both batches.
+Its boundary can be modified via the b.val
argument.
+Likewise, the lightly shaded dashed dots that highlight the mid 80% of
+values can be modified via the l.val
argument.
For example, to highlight the mid 68% of values using the grey boxes +and the mid 95% of values using the lightly shaded dashed dots, +type:
+ + +You can suppress the plotting of the grey box and the lightly shaded
+dashed dots by setting q = FALSE
. This does not affect the
+median dashed lines.
Applying a formula to one of the batches +
+You can apply a formula to a batch via the fx
argument
+for the x-variable and the fy
argument for the y-variable.
+The formula is passed as a text string. For example, to add
+0.5
to the x values, type:
+eda_qq(x, y, fx = "x + 0.5")
Quantile type +
+There are many different quantile algorithms available in R. To see
+the full list of quantile types, refer to the quantile help page:
+?quantile
. By default, eda_qq()
adopts
+q.type = 5
. In general, the choice of quantiles will not
+really matter, especially for large datasets. If you want to adopt R’s
+default type, set q.type = 7
.
Point symbols +
+The point symbol type, color and size can be modified via the
+pch
, p.col
(and/or p-fill
) and
+size
arguments. The color can be either a built-in color
+name (you can see the full list by typing colors()
) or the
+rgb()
function. If you define the color using one of the
+built-in color names, you can adjust its transparency via the
+alpha
argument. A value of 0
renders the point
+completely transparent and a value of 1
renders the point
+completely opaque. The point symbol can take on two color parameters
+depending on point type. If pch
is any number between 21
+and 25, p.fill
will define its fill color and
+p.col
will define its border color. For any other point
+symbol type, the p.fill
argument is ignored.
Here are a few examples:
+
+eda_qq(x, y, p.fill = "bisque", p.col = "red", size = 1.2)
+eda_qq(x, y, pch = 16, p.col = "tomato2", size = 1.5, alpha = 0.5)
+eda_qq(x, y, pch = 3, p.col = "tomato2", size = 1.5)
Interpreting a QQ plot +
+To help interpret the following QQ plots, we’ll compare each plot to +matching kernel density plots. We’ll start with a simple example
+ + + + + + + + +Power transformation +
+
+s1 <- subset(Indometh, Subject == 1, select = conc, drop = TRUE)
+s2 <- subset(Indometh, Subject == 2, select = conc, drop = TRUE)
+eda_qq(s1, s2, p = 1)
+eda_qq(s1, s2, p = 0)
+eda_qq(s1,s2,p=0,fx="x * 0.8501 + 0.3902")
+eda_dens(s1, s2)
The Tukey mean-difference plot +
+This is simply an extension of the QQ plot whereby the plot is
+rotated such that the 45° line (the 1:1 slope) becomes horizontal. This
+can be useful in helping identify subtle differences between batches.
+The plot is rotated 45° by mapping the difference between both batches
+to the y-axis, and mapping the mean between both batches to the x-axis.
+For example, the following figure on the left (the QQ plot) shows the
+additive offset between both batches but it fails to show the
+multiplicative offset. The latter can be clearly seen in the Tukey
+mean-difference plot (on the right) which can be invoked by setting the
+argument md = TRUE
.
A working example +
+
+old <- wat95$avg # legacy temperature normals
+new <- wat05$avg # current temperature normals
+out <- eda_qq(old, new)
+old <- wat95$avg # legacy temperature normals
+new <- wat05$avg # current temperature normals
+out <- eda_qq(old, new, md = TRUE)
+labs <- c("low", "mid", "high")
+out$avg <- (out$x + out$y) / 2
+out <- as.data.frame(out[c(1:2,6)])
+out2 <- split(out, cut(out$avg, c(min(out$avg), 35, 50, max(out$avg)),
+ labels = labs, include.lowest = TRUE))
+sapply(labs, FUN = \(x) {eda_qq(out2[[x]]$x, out2[[x]]$y ,
+ xlab = "old", ylab = "new", md = T)
+ title(x, line = 3, col.main = "orange")} )
#> [1] "Suggested offsets:y = x * 0.9506 + (1.661)"
+#> [1] "Suggested offsets:y = x * 1.0469 + (-1.6469)"
+#> [1] "Suggested offsets:y = x * 0.9938 + (0.9268)"
+What not to do:
+
+labs <- c("low", "mid", "high")
+old.split <- split(old, cut(old, c(min(old), 35, 50, max(old)),
+ labels = labs, include.lowest = TRUE))
+new.split <- split(new, cut(new, c(min(new), 35, 50, max(new)),
+ labels = labs, include.lowest = TRUE))
+sapply(labs, FUN = \(x) {eda_qq(old.split[[x]], new.split[[x]] ,
+ xlab = "old", ylab = "new", md = T)
+ title(x, line = 3, col.main = "orange")} )
+xform <- c("x * 0.9636 + 1.4411",
+ "x * 0.9838 + 0.7478",
+ "x * 1.0365 - 2.0333")
+names(xform) <- labs
+sapply(labs, FUN = \(x) {eda_qq(old.split[[x]], new.split[[x]] ,
+ fx = xform[x],
+ xlab = "old", ylab = "new", md = T);
+ title(x, line = 3, col.main = "orange")} )
The characterization of the differences in normal temperatures +between the old and new set of normals can be formalized as follows:
+\[ +new = \begin{cases} +old * 0.9636 + 1.4411, & T_{avg} < 35 \\ +old * 0.9838 + 0.7478, & 35 \le T_{avg} < 50 \\ +old * 1.0365 - 2.0333, & T_{avg} \ge 50 +\end{cases} +\]
+The key takeaways from this analysis can be summarized as +follows:
+-
+
- +