Contents

1 Overview

1.1 Motivation

The identification of reproducible biological patterns from high-dimensional omics data is a key factor in understanding the biology of complex disease or traits. Incorporating prior biological knowledge into machine learning is an important step in advancing such research.

1.2 Deliverables

We have implemented a biologically informed multi-stage machine learing framework termed BioMM [1] specifically for phenotype prediction using omics-scale data based on biological prior information, for example, gene ontological pathways.

Features of BioMM in a nutshell:

  1. Applicability for various omics data modalities.
  2. Prioritizing outcome-associated functional patterns.
  3. End-to-end prediction at the individual level based on biological pathway stratification.
  4. Possibility for extension to machine learning models of interest.
  5. Parallel computing.

2 Getting started

2.1 Installation

  • Development version from Github (R 3.5):
install.packages("devtools")
library("devtools")
install_github("transbioZI/BioMM")
  • Install BioMM from Bioconductor (R 3.6):
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("BioMM")

Note that, two different parallel computing strategies are implemented: For BioMM package available on Github, the parallel R package is implemented. The BiocParallel R package is employed in Bioconductor BioMM package which allows the use of parallel computation across any system.

  • Load required libraries
library(BioMM)
library(BiocParallel)
library(parallel)
library(ranger)
library(rms)
library(glmnet)
library(e1071)
library(pROC)
library(vioplot)
library(variancePartition)
library(CMplot)

3 Omics data

A wide range of genome-wide omics data is supported for the use of BioMM including whole-genome DNA methylation, gene expression and genome-wide SNP data. Other types of omics data that can map into pathways are also encouraging.
For better understanding of the framework, we used a preprocessed genome-wide DNA methylation data with 26486 CpGs and 40 samples consisting of 20 controls and 20 patients. (0: healthy control and 1: patient) for demonstration.

## Get DNA methylation data 
studyData <- readRDS(system.file("extdata", "/methylData.rds", 
                     package="BioMM"))
head(studyData[,1:5])
##           label cg00000292 cg00002426 cg00003994 cg00005847
## GSM951223     1     0.0274     0.0029    -0.0027    -0.0196
## GSM951231     1     0.0644     0.0181    -0.0088     0.0057
## GSM951249     1    -0.0304    -0.0013    -0.0083    -0.0116
## GSM951273     1     0.0252     0.0039    -0.0091     0.0030
## GSM951214     1     0.0289    -0.0011    -0.0129     0.0173
## GSM951270     1     0.0635     0.0329     0.0184    -0.0023
dim(studyData)
## [1]    40 26487

4 Feature stratification

Features like CpGs, genes or SNPs can be mapped into pathways based on genomic location and pathway annotation, as implemented in the function omics2pathlist(). The examples of pathway databases are gene ontology (GO), Reactome and KEGG, which are widely used public pathway repositories. Gene ontological pathway is used in this tutorial.

## Load annotation data
featureAnno <- readRDS(system.file("extdata", "cpgAnno.rds", package="BioMM")) 
pathlistDB <- readRDS(system.file("extdata", "goDB.rds", package="BioMM")) 
head(featureAnno)
##           ID chr entrezID symbol
## 1 cg00000292  16      487 ATP2A1
## 2 cg00002426   3     7871  SLMAP
## 3 cg00003994   7     4223  MEOX2
## 4 cg00005847   2     3232  HOXD3
## 5 cg00006414   7    57541 ZNF398
## 6 cg00007981  11    24145  PANX1
str(pathlistDB[1:3])
## List of 3
##  $ GO:0000002: Named chr [1:12] "291" "1890" "4205" "4358" ...
##   ..- attr(*, "names")= chr [1:12] "TAS" "IMP" "ISS" "IMP" ...
##  $ GO:0000012: Named chr [1:11] "3981" "7141" "7515" "23411" ...
##   ..- attr(*, "names")= chr [1:11] "IDA" "IDA" "IEA" "IMP" ...
##  $ GO:0000027: Named chr [1:31] "4839" "6122" "6123" "6125" ...
##   ..- attr(*, "names")= chr [1:31] "IMP" "IBA" "IBA" "IBA" ...
## Map to pathways (only 100 pathways in order to reduce the runtime)
pathlistDBsub <- pathlistDB[1:100]
pathlist <- omics2pathlist(data=studyData, pathlistDBsub, featureAnno, 
                           restrictUp=100, restrictDown=20, minPathSize=10) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   38.00   50.50   60.84   74.25  164.00

5 BioMM framework

5.1 Recapitulation

Briefly, the BioMM framework consists of two learning stages [1]. During the first stage, biological meta-information is used to ‘compress’ the variables of the original dataset into pathway-level ‘latent variables’ (henceforth called stage-2 data) using either supervised or unsupervised learning models. In the second stage, a supervised model is built using the stage-2 data with non-negative outcome-associated features for prediction.

5.2 End-to-end prediction modules

5.2.1 Interface to machine learning models

The end-to-end prediction is performed using BioMM() function. Both supervised and unsupervised learning are implemented in the BioMM framework, which are indicated by the argument supervisedStage1=TRUE or supervisedStage1=FALSE. Commonly used supervised classifiers: generalized regression models with lasso, ridge or elastic net regularization (GLM) [4], support vector machine (SVM) [3] and random forest [2] are included. For the unsupervised method, regular or sparse constrained principal component analysis (PCA) [5] is used. Generic resampling methods include cross-validation (CV) and bootstrapping (BS) procedures as the argument resample1="CV" or resample1="BS". Stage-2 data is reconstructed using either resampling methods during machine learning prediction or independent test set prediction if the argument testData is provided.

5.2.1.1 Example

To apply random forest model, we use the argument classifier=randForest in BioMM() with the classification mode at both stages. predMode indicate the prediction type, here we use classification for binary outcome prediction. A set of model hyper-parameters are supplied by the argument paramlist at both stages. Pathway-based stratification is carried out in this example. We focused on the autosomal region to limit the potential influence of sex on machine learning due to the phenomenon of X chromosome inactivation or the existence of an additional X chromosome in female samples. Therefore it’s suggested to exclude sex chromosome in the user-supplied featureAnno input file.

## Parameters
supervisedStage1=TRUE
classifier <- "randForest"
predMode <- "classification"
paramlist <- list(ntree=300, nthreads=10)   
param1 <- MulticoreParam(workers = 1)
param2 <- MulticoreParam(workers = 10)
## If BioMM is installed from Github, please use the following assignment:
## param1 <- 1
## param2 <- 10

studyDataSub <- studyData[,1:2000] ## to reduce the runtime
set.seed(123)
result <- BioMM(trainData=studyDataSub, testData=NULL, pathlistDB, featureAnno, 
                restrictUp=100, restrictDown=10, minPathSize=10, 
                supervisedStage1, typePCA="regular", 
                resample1="BS", resample2="CV", dataMode="allTrain",
                repeatA1=50, repeatA2=1, repeatB1=20, repeatB2=1, 
                nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
                cutP1=0.1, cutP2=0.1, fdr2=NULL, FScore=param1, 
                classifier, predMode, 
                paramlist, innerCore=param2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   10.00   11.00   11.98   13.00   22.00 
## 
##  Levels of predicted Y = 2 
## 
##    AUC ACC    R2
## R2 0.9 0.9 0.695
print(result)
##    AUC ACC    R2
## R2 0.9 0.9 0.695

Other machine learning models can be employed with the following respective parameter settings. For the classifier "SVM", parameters can be tuned using an internal cross validation if tuneP=TRUE. For generalized regression model glmnet, elastic net is specified by the input argument alpha=0.5. Alternatively, alpha=1 is for the lasso and alpha=0 is the ridge. For the unsupervised learning supervisedStage1=FALSE, regular PCA typePCA="regular" is applied and followed with random forest classification classifier2=TRUE.

## SVM 
supervisedStage1=TRUE
classifier <- "SVM"
predMode <- "classification"
paramlist <- list(tuneP=FALSE, kernel="radial", 
                  gamma=10^(-3:-1), cost=10^(-3:1))

## GLM with elastic-net
supervisedStage1=TRUE
classifier <- "glmnet"
predMode <- "classification" 
paramlist <- list(family="binomial", alpha=0.5, 
                   ypeMeasure="mse", typePred="class")

## PCA + random forest
supervisedStage1=FALSE
classifier <- "randForest"
predMode <- "classification"
paramlist <- list(ntree=300, nthreads=10)  

5.2.2 Interface to biological stratification strategies

For stratification of predictors using biological information, various strategies can be applied. Currently, BioMM() integrates gene ontological pathway based stratification, which not only accounts for epistasis between first level variables within the functional category, but also considers the interaction between pathways. Therefore, this may provide better information on functional insight.

5.2.2.1 Example

End-to-end prediction based on pathway-wide stratification on genome-wide DNA methylation data is demonstrated below. PCA is used at stage-1 to reconstruct pathway level data, then the random forest model with 10-fold cross validation is applied on stage-2 data to estimate the prediction performance.

## Parameters
supervisedStage1=FALSE
classifier <- "randForest"
predMode <- "classification"
paramlist <- list(ntree=300, nthreads=10)   
param1 <- MulticoreParam(workers = 1)
param2 <- MulticoreParam(workers = 10)
## If BioMM is installed from Github, please use the following assignment:
## param1 <- 1
## param2 <- 10

set.seed(123)
result <- BioMM(trainData=studyData, testData=NULL, 
                pathlistDBsub, featureAnno, 
                restrictUp=100, restrictDown=10, minPathSize=10, 
                supervisedStage1, typePCA="regular", 
                resample1="BS", resample2="CV", dataMode="allTrain",
                repeatA1=40, repeatA2=1, repeatB1=40, repeatB2=1, 
                nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
                cutP1=0.1, cutP2=0.1, fdr2=NULL, FScore=param1, 
                classifier, predMode, paramlist, innerCore=param2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   20.00   37.00   45.57   55.00  164.00 
## 
##  Levels of predicted Y = 2 
## 
##      AUC   ACC    R2
## R2 0.725 0.725 0.259
print(result)
##      AUC   ACC    R2
## R2 0.725 0.725 0.259

5.3 Stage-2 data exploration

5.3.1 Generation of stage-2 data

Here we demonstrate using supervised random forest method on genome-wide DNA methylation. Gene ontological pathways are used for the generation of stage-2 data.

## Pathway level data or stage-2 data prepared by reconBySupervised()
stage2dataA <- readRDS(system.file("extdata", "/stage2dataA.rds", 
                       package="BioMM"))

head(stage2dataA[,1:5])
##           label GO:0000027 GO:0000045 GO:0000050 GO:0000060
## GSM951223     1      0.663      0.557      0.707      0.653
## GSM951231     1      0.655      0.710      0.737      0.686
## GSM951249     1      0.776      0.568      0.757      0.533
## GSM951273     1      0.664      0.510      0.741      0.642
## GSM951214     1      0.662      0.632      0.530      0.582
## GSM951270     1      0.419      0.366      0.474      0.415
dim(stage2dataA)
## [1] 40 51
#### Alternatively, 'stage2dataA' can be created by the following code:
## Parameters  
classifier <- "randForest" 
predMode <- "probability" 
paramlist <- list(ntree=300, nthreads=40)  
param1 <- MulticoreParam(workers = 1)
param2 <- MulticoreParam(workers = 10) 
## If BioMM is installed from Github, please use the following assignment:
## param1 <- 1
## param2 <- 10
set.seed(123)
## This will take a bit longer to run
stage2dataA <- reconBySupervised(trainDataList=pathlist, testDataList=NULL,
                            resample="BS", dataMode="allTrain",
                            repeatA=25, repeatB=1, nfolds=10,
                            FSmethod=NULL, cutP=0.1, fdr=NULL, FScore=param1,
                            classifier, predMode, paramlist,
                            innerCore=param2, outFileA=NULL, outFileB=NULL)

5.3.2 Visualization

5.3.2.1 Explained variation of stage-2 data

The distribution of the proportion of variance explained for the individual generated feature of stage-2 data for the classification task is illustrated plotVarExplained() below. Nagelkerke pseudo R-squared measure is used to compute the explained variance. The argument posF=TRUE indicates that only positively outcome-associated features are plotted, since negative associations likely reflect random effects in the underlying data [6].

param <- MulticoreParam(workers = 1) 
## If BioMM is installed from Github, please use the following assignment:
## param <- 1 
plotVarExplained(data=stage2dataA, posF=TRUE, 
                 core=param, horizontal=FALSE, fileName=NULL)
## png 
##   2

5.3.2.2 Prioritization of outcome-associated functional patterns

plotRankedFeature() is employed to rank and visualize the outcome-associated features from stage-2 data. The argument topF=10 and posF=TRUE are used to define the top 10 positively outcome-associated features. Nagelkerke pseudo R-squared measure is utilized to evaluate the importance of the ranked features as indicated by the argument rankMetric="R2". The size of the investigated pathway is pictured as the argument colorMetric="size".

param <- MulticoreParam(workers = 10) 
## If BioMM is installed from Github, please use the following assignment:
## param <- 1 
topPath <- plotRankedFeature(data=stage2dataA, 
                             posF=TRUE, topF=10, 
                             blocklist=pathlist,
                             rankMetric="R2", 
                             colorMetric="size", 
                             core=param, fileName=NULL)

5.3.2.3 The significance of CpGs in pathways of interest

cirPlot4pathway() illustrates the significance of the individual CpGs falling into a set of pathways. Here the top 10 outcome-associated pathways are investigated. Negative log P value is used to define the significance of each CpG within these pathways.

core <- MulticoreParam(workers = 10)  
## If BioMM is installed from Github, use the following assignment:
## core <- 10
pathID <- gsub("\\.", ":", rownames(topPath))
pathSet <- pathlist[is.element(names(pathlist), pathID)]
pathMatch <- pathSet[match(pathID, names(pathSet))]

cirPlot4pathway(datalist=pathMatch, 
                topPathID=names(pathMatch), core, fileName=NULL)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
##  Circular_Manhattan Plotting pval...
##  Plots are stored in: /tmp/Rtmp7yftvq/Rbuild766472049e0d/BioMM/vignettes
## png 
##   3

5.4 Computational consideration

BioMM with supervised models at both stages incorporating pathway based stratification method will take longer to run than unsupervised approaches. But the prediction is more powerful. Therefore, we suggest the former even if the computation is more demanding, as the adoption of 5G is pushing advances in computational storage and speed. Parallel computing is implemented and recommended for such scenario. In this vignette, due to the runtime, we only showcased the smaller examples and models with less computation.

6 Session information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] CMplot_3.4.0             variancePartition_1.16.0 Biobase_2.46.0          
##  [4] BiocGenerics_0.32.0      scales_1.0.0             limma_3.42.0            
##  [7] vioplot_0.3.2            zoo_1.8-6                sm_2.2-5.6              
## [10] pROC_1.15.3              e1071_1.7-2              glmnet_2.0-18           
## [13] foreach_1.4.7            Matrix_1.2-17            rms_5.1-3.1             
## [16] SparseM_1.77             Hmisc_4.2-0              ggplot2_3.2.1           
## [19] Formula_1.2-3            survival_2.44-1.1        lattice_0.20-38         
## [22] ranger_0.11.2            BiocParallel_1.20.0      BioMM_1.2.0             
## [25] BiocStyle_2.14.0        
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-141        bitops_1.0-6        pbkrtest_0.4-7      doParallel_1.0.15  
##  [5] RColorBrewer_1.1-2  progress_1.2.2      tools_3.6.1         backports_1.1.5    
##  [9] R6_2.4.0            rpart_4.1-15        KernSmooth_2.23-16  lazyeval_0.2.2     
## [13] colorspace_1.4-1    nnet_7.3-12         withr_2.1.2         prettyunits_1.0.2  
## [17] tidyselect_0.2.5    gridExtra_2.3       compiler_3.6.1      quantreg_5.51      
## [21] htmlTable_1.13.2    sandwich_2.5-1      labeling_0.3        bookdown_0.14      
## [25] caTools_1.17.1.2    checkmate_1.9.4     polspline_1.1.16    mvtnorm_1.0-11     
## [29] stringr_1.4.0       digest_0.6.22       foreign_0.8-72      minqa_1.2.4        
## [33] rmarkdown_1.16      colorRamps_2.3      base64enc_0.1-3     pkgconfig_2.0.3    
## [37] htmltools_0.4.0     lme4_1.1-21         htmlwidgets_1.5.1   rlang_0.4.1        
## [41] rstudioapi_0.10     gtools_3.8.1        acepack_1.4.1       dplyr_0.8.3        
## [45] magrittr_1.5        Rcpp_1.0.2          munsell_0.5.0       stringi_1.4.3      
## [49] multcomp_1.4-10     yaml_2.2.0          MASS_7.3-51.4       plyr_1.8.4         
## [53] gplots_3.0.1.1      grid_3.6.1          gdata_2.18.0        crayon_1.3.4       
## [57] splines_3.6.1       hms_0.5.1           zeallot_0.1.0       knitr_1.25         
## [61] pillar_1.4.2        tcltk_3.6.1         boot_1.3-23         reshape2_1.4.3     
## [65] nsprcomp_0.5.1-2    codetools_0.2-16    glue_1.3.1          evaluate_0.14      
## [69] latticeExtra_0.6-28 data.table_1.12.6   BiocManager_1.30.9  vctrs_0.2.0        
## [73] nloptr_1.2.1        MatrixModels_0.4-1  gtable_0.3.0        purrr_0.3.3        
## [77] assertthat_0.2.1    xfun_0.10           class_7.3-15        tibble_2.1.3       
## [81] iterators_1.0.12    cluster_2.1.0       TH.data_1.0-10

7 References

[1] NIPS ML4H submission: Chen, J. and Schwarz, E., 2017. BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.

[2] Breiman, L. (2001). “Random forests.” Machine learning 45(1): 5-32.

[3] Cortes, C., & Vapnik, V. (1995). “Support-vector networks.” Machine learning 20(3): 273-297.

[4] Friedman, J., Hastie, T., & Tibshirani, R. (2010). “Regularization paths for generalized linear models via coordinate descent.” Journal of statistical software 33(1): 1.

[5] Wold, S., Esbensen, K., & Geladi, P. (1987). “Principal component analysis.” Chemometrics and intelligent laboratory systems 2(1-3): 37-52.

[6] Claudia Perlich and Grzegorz Swirszcz. On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2):11-15, 2011.

8 Questions & Comments

If you have any questions or comments ?

Contact: junfang.chen@zi-mannheim.de