This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and \(R^2\) and MSE as performance measure are implemented.
The pengls package is available from BioConductor, and can be installed as follows:
library(BiocManager)
install("pengls")
Once installed, it can be loaded and version info printed.
suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 1.4.0
We first create a toy dataset with spatial coordinates.
library(nlme)
<- 25 #Sample size
n <- 50 #Number of features
p <- 15 #Size of the grid
g #Generate grid
<- expand.grid("x" = seq_len(g), "y" = seq_len(g))
Grid # Sample points from grid without replacement
<- Grid[sample(nrow(Grid), n, replace = FALSE),]
GridSample #Generate outcome and regressors
<- matrix(rnorm(p*n), n , p)
b <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
a #Compile to a matrix
<- data.frame("a" = a, "b" = b, GridSample) df
The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.
# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
<- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5)) corStruct
Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter \(\lambda=0.2\).
#Fit the pengls model, for simplicity for a simple lambda
<- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, lambda = 0.2, verbose = TRUE) penglsFit
## Starting iterations...
## Iteration 1
## Iteration 2
## Iteration 3
## Iteration 4
## Iteration 5
## Iteration 6
## Iteration 7
## Iteration 8
## Iteration 9
## Iteration 10
## Iteration 11
## Iteration 12
## Iteration 13
## Iteration 14
## Iteration 15
## Iteration 16
## Iteration 17
## Iteration 18
## Iteration 19
## Iteration 20
## Iteration 21
## Iteration 22
## Iteration 23
## Iteration 24
## Iteration 25
## Iteration 26
Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.
penglsFit
## pengls model with correlation structure: corGaus
## and 18 non-zero coefficients
<- coef(penglsFit)
penglsCoef <- predict(penglsFit) penglsPred
The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:
set.seed(354509)
<- 100 #Sample size
n <- 10 #Number of features
p #Generate outcome and regressors
<- matrix(rnorm(p*n), n , p)
b <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
a #Compile to a matrix
<- data.frame("a" = a, "b" = b, "t" = seq_len(n))
dfTime <- corAR1(form = ~ t, value = 0.5) corStructTime
The fitting command is similar, this time the \(\lambda\) parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose \(\alpha=0.5\) this time, fitting an elastic net model.
<- pengls(data = dfTime, outVar = "a", verbose = TRUE,
penglsFitTime xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
## Fitting naieve model...
## Starting iterations...
## Iteration 1
## Iteration 2
Show the output
penglsFitTime
## pengls model with correlation structure: corAR1
## and 2 non-zero coefficients
The pengls package also provides cross-validation for finding the optimal \(\lambda\) value. If the tuning parameter \(\lambda\) is not supplied, the optimal \(\lambda\) according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :
library(BiocParallel)
register(MulticoreParam(3)) #Prepare multithereading
<- 3 #Number of cross-validation folds nfolds
The function is called similarly to cv.glmnet:
<- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, nfolds = nfolds) penglsFitCV
Check the result:
penglsFitCV
## Cross-validated pengls model with correlation structure: corGaus
## and 19 non-zero coefficients.
## 3 fold cross-validation yielded an estimated R2 of -0.7773331 .
By default, the 1 standard error is used to determine the optimal value of \(\lambda\) :
$lambda.1se #Lambda for 1 standard error rule penglsFitCV
## [1] 0.1607606
$cvOpt #Corresponding R2 penglsFitCV
## [1] -0.7773331
Extract coefficients and fold IDs.
head(coef(penglsFitCV))
## [1] 0.765596399 0.000000000 0.002013268 0.000000000 -0.821955840
## [6] -0.204677170
$foldid #The folds used penglsFitCV
## 80 106 149 171 94 89 181 170 83 10 18 225 59 224 98 49 33 3 46 128
## 1 3 2 3 3 2 3 2 1 1 1 2 1 2 1 1 1 1 3 2
## 221 1 217 165 222
## 2 1 2 2 2
By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:
set.seed(5657)
<- makeFolds(nfolds = nfolds, dfTime, "random", "t")
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
blockedFolds plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))
To perform random cross-validation
<- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "random") penglsFitCVtime
To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:
<- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter $cvOpt #Better performance penglsFitCVtimeCenter
## [1] 0.9949131
Alternatively, the mean squared error (MSE) can be used as loss function, rather than the default \(R^2\):
<- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, loss = "MSE") penglsFitCVtime
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.32.1 nlme_3.1-158 pengls_1.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 knitr_1.39 magrittr_2.0.3 splines_4.2.1
## [5] lattice_0.20-45 R6_2.5.1 rlang_1.0.4 fastmap_1.1.0
## [9] foreach_1.5.2 highr_0.9 stringr_1.4.0 tools_4.2.1
## [13] parallel_4.2.1 grid_4.2.1 glmnet_4.1-4 xfun_0.31
## [17] cli_3.3.0 jquerylib_0.1.4 htmltools_0.5.2 iterators_1.0.14
## [21] survival_3.3-1 yaml_2.3.5 digest_0.6.29 Matrix_1.4-1
## [25] sass_0.4.1 codetools_0.2-18 shape_1.4.6 evaluate_0.15
## [29] rmarkdown_2.14 stringi_1.7.8 compiler_4.2.1 bslib_0.3.1
## [33] jsonlite_1.8.0