Vignette of the pengls package

Stijn Hawinkel

1 Introduction

This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and \(R^2\) as performance measure are implemented.

2 Installation instuctions

The pengls package is available from BioConductor, and can be installed as follows:

library(BiocManager)
install("pengls")

Once installed, it can be loaded and version info printed.

suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 0.99.7

3 Illustration

3.1 Spatial autocorrelation

We first create a toy dataset with spatial coordinates.

library(nlme)
n <- 75 #Sample size
p <- 100 #Number of features
g <- 10 #Size of the grid
#Generate grid
Grid <- expand.grid("x" = seq_len(g), "y" = seq_len(g))
# Sample points from grid without replacement
GridSample <- Grid[sample(nrow(Grid), n, replace = FALSE),]
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
df <- data.frame("a" = a, "b" = b, GridSample)

The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.

# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corStruct <- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5))

Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter \(\lambda=0.2\).

#Fit the pengls model, for simplicity for a simple lambda
penglsFit <- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt = corStruct, lambda = 0.2, verbose = TRUE)
## Starting iterations...
## Iteration 1 
## Iteration 2 
## Iteration 3

Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.

penglsFit
## pengls model with correlation structure: corGaus 
##  and 40 non-zero coefficients
penglsCoef <- coef(penglsFit)
penglsPred <- predict(penglsFit)

3.2 Temporal autocorrelation

The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:

dfTime <- data.frame("a" = a, "b" = b, "t" = seq_len(n))
corStructTime <- corAR1(form = ~ t, value = 0.5)

The fitting command is similar, this time the \(\lambda\) parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose \(\alpha=0.5\) this time, fitting an elastic net model.

penglsFitTime <- pengls(data = dfTime, outVar = "a", verbose = TRUE,
xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
## Fitting naieve model...
## Starting iterations...
## Iteration 1 
## Iteration 2 
## Iteration 3

Show the output

penglsFitTime
## pengls model with correlation structure: corAR1 
##  and 50 non-zero coefficients

3.3 Penalty parameter and cross-validation

The pengls package also provides cross-validation for finding the optimal \(\lambda\) value. If the tuning parameter \(\lambda\) is not supplied, the optimal \(\lambda\) according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :

library(BiocParallel)
register(MulticoreParam(3)) #Prepare multithereading
nfolds =10 #Number of cross-validation folds

The function is called similarly to cv.glmnet:

penglsFitCV <- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt <- corStruct, nfolds = nfolds)

Check the result:

penglsFitCV
## Cross-validated pengls model with correlation structure: corGaus 
##  and 50 non-zero coefficients.
##  10 fold cross-validation yielded an estimated R2 of -0.1683754 .

By default, the 1 standard error is used to determine the optimal value of \(\lambda\) :

penglsFitCV$lambda.1se #Lambda for 1 standard error rule
## [1] 0.02722127
penglsFitCV$cvOpt #Corresponding R2
## [1] -0.1683754

Extract coefficients and fold IDs.

head(coef(penglsFitCV))
## [1]  0.01605160  0.00000000  0.00000000 -0.03678754  0.91237366  0.00000000
penglsFitCV$foldid #The folds used
##  89  20  93   9   1  25  55  57  66  35  50  13  71  18  36  77  63   7  12  37 
##   5   5   2  10   9  10  10   5   3   3   1   8   7  10  10   2   4   9   8   5 
##  22  47  72  51  94  53  16  44  60   4   8  26  11  23  21  96  73  64  87  52 
##   8  10   4   3   2  10   9   7   5   8  10   9   8   8   8   7   2   3   6   7 
##  34  65  33  80  84  14 100  68  48  86  49  75  69  82  76  59  79  70  74  41 
##   8   3  10   5   3   8   5   5   9   3  10   2   6   4   7   6   1   6   2   3 
##  81  45  67  78  91  99  28  61  31  92  83  54  88  43  90 
##   4   5   1   6   4   6   9   7   4   7   2   3   1   3   6

By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:

set.seed(5657)
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "random", "t")
blockedFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))

To perform random cross-validation

penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt <- corStructTime, nfolds = nfolds, cvType = "random")

To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:

penglsFitCVtimeCenter <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE),
glsSt <- corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter$cvOpt #Better performance
## [1] 0.9213925

4 Session info

sessionInfo()
## R version 4.1.1 Patched (2021-08-22 r80813)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.27.17 nlme_3.1-153         pengls_0.99.7       
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.36       magrittr_2.0.1   splines_4.1.1    lattice_0.20-45 
##  [5] R6_2.5.1         rlang_0.4.12     fastmap_1.1.0    foreach_1.5.1   
##  [9] highr_0.9        stringr_1.4.0    tools_4.1.1      parallel_4.1.1  
## [13] grid_4.1.1       glmnet_4.1-2     xfun_0.27        jquerylib_0.1.4 
## [17] htmltools_0.5.2  iterators_1.0.13 yaml_2.2.1       survival_3.2-13 
## [21] digest_0.6.28    Matrix_1.3-4     sass_0.4.0       codetools_0.2-18
## [25] shape_1.4.6      evaluate_0.14    rmarkdown_2.11   stringi_1.7.5   
## [29] compiler_4.1.1   bslib_0.3.1      jsonlite_1.7.2