Gin is a collection of pure R tools for generating and manipulating Gaussian
Process models (GPs). It is also a nice spirit
The current version includes the top-level functions:
-
gp_simulate - generate N psuedo-random GP realisations.
-
gp_conditional - compute the mean and covariance of a GP given data points.
-
gp_fit - fit for (hyper-)parameters of the autocovariance function (ACV)
-
plot_ts - plot a time series data.frame.
-
plot_snake - plot a 'snake' representing the mean and std.dev of a GP.
-
ts_load - load a plain text data file into an array ready for analysis.
And some lower-level functions that do the heavy-lifting
-
gp_logLikelihood - compute Gaussian log likelihood given data and model.
-
gp_logPosterior - compute log posterior, given log likelihood and prior.
-
gp_lagMatrix - compute matrix of lags tau_ij = |t_i - t_j|
Gin is an R package, but is still in development. To set up from GitHub first install (if you haven't already) Hadley Wickham's devtools package.
install.packages("devtools")
Now you can install gin straight from GitHub:
devtools::install_github("svdataman/gin")
Now you're good to go.
First we must define an ACV function. This is the exponential ACV for a Damped Randon Walk (DRW) process:
# define an ACV function
acv <- function(theta, tau) {
A <- abs(theta[1])
l <- abs(theta[2])
acov <- A * exp(-(tau / l))
return(acov)
}
This accepts a vector of parameters (theta), and a matrix of vector of lags (tau) and returns the ACV value at each lag.
Now we define some initial parameters
# define parameters of ACV
# theta[1] = mu (mean)
# theta[2] = nu (error scale factor)
# theta[3:p] = parameters of ACV
theta <- c(0.0, 1.0, 1.0, 1.0)
Next, we define a list of times at which to compute the simulated data
# define vector of times for reconstruction
m <- 1000
t <- seq(-0.5, 10.5, length = m)
Using the ACV model and parameters (theta) as above, and the grid of times (t) we can generate a random realisation of the GP
# produce Gaussian vector (simulation)
y <- gin::gp_sim(theta, acv.model = acv, t.star = t)
y <- as.vector(y)
# plot the 'true' curve
dat <- data.frame(t = t, y = y)
gin::plot_ts(dat, type = "l")
The package comes with a dataset called drw. This shows 40 observations of a random process. The drw data.frame has 3 columns: t (time), y (value), dy (error). Let's plot the data
gin::plot_ts(drw, col = "black", cex.lab=1.4)
Using the ACV model above, and the values for theta as our starting guess, we fit:
# fit the model to the data, find Max.Like parameter values
result <- gin::gp_fit(theta, acv.model = acv, dat = drw)
This should find the maximum likelihood solution. The ACV parameters (for of them in this case) are in result$par. We can use this to reconstruct the GP as follows.
We can use the gp_conditional function to compute the mean and covariance of the GP, at the simulation time t (above) with the specified ACV model, conditional on the given data.
# reconstruct process: compute 'conditional' mean and covariance
gp <- gin::gp_conditional(result$par, acv.model = acv, dat = drw, t.star = t)
and now we can overlay the conditional model
# plot a 'snake' showing mean +/- std.dev
gin::plot_snake(gp, add = TRUE, col.line = 3)
We can also add psuedo-random realisations of this process (conditional on the data)
y.sim <- gin::gp_sim(result$par, dat = drw, acv.model = acv, t.star = t,
N.sim = 5, plot = FALSE)
for (i in 1:5) lines(t, y.sim[, i], col = i)
Rather than Maximum Likelihood, we could specify priors on the (hyper-)parameters of the ACV, and peform Bayesian inference.
# define the 'priors' for the parameter values
logPrior <- function(theta) {
mu.d <- dnorm(theta[1], sd = 5, log = TRUE)
nu.d <- dlnorm(theta[2], meanlog = 0, sdlog = 1, log = TRUE)
A.d <- dlnorm(theta[3], meanlog = 0, sdlog = 1, log = TRUE)
l.d <- dlnorm(theta[4], meanlog = 0, sdlog = 1, log = TRUE)
return(mu.d + nu.d + A.d + l.d)
}
In general, to do this we need an MCMC tool to sample from the posterior. Here I use the GW sampler from tonic.
# Use gw.mcmc to generate parameter samples
chain <- tonic::gw_sampler(gp_logPosterior, theta.0 = theta,
acv.model = acv, logPrior = logPrior,
dat = drw, burn.in = 1e4,
nsteps = 40e4,
chatter = 1, thin = 10)
This takes a minute or two to run. It should produce 2,000 samples after `thinning' by a factor 10 (it generates 20,000 samples but keeps only 1 in 10). First, we inspect the traces and autocorrelation of the chains.
# plot MCMC diagnostics
tonic::chain_diagnosis(chain)
Now we can visualise the posterior by plotting the 1 and 2 parameter marginal distributions.
# name the parameters
colnames(chain$theta) <- c("mu", "nu", "A", "l")
# plot scatter diagrams
tonic::contour_matrix(chain$theta, prob.levels = c(1,2,3), sigma = TRUE,
prob1d = 0)
The contours represent 1-, 2- and 3-sigma levels (in the sense of 68.3%, 95.4% and 99.7% of the mass).
- Inclide a range of ready-to-use ACV functions
- Compute ACFs based on FT of specified PS
- Allow for time binning (not point samples in time)
- Allow for lognormal process
- use matrix rather than data.frame format for data (speed)
For more on GPs, the best reference is:
- C. E. Rasmussen & K. I. Williams, Gaussian Processes for Machine Learning, online here
Excellent online resources inclide
A Visual Exploration of Gaussian Processes