Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. One critical unmet challenge is that molecular disease subtypes characterized by relevant clinical differences, such as survival, are difficult to differentiate. With the advancement of multi-omics technologies, subtyping methods have shifted toward data integration in order to differentiate among subtypes from a holistic perspective that takes into consideration phenomena at multiple levels. However, these integrative methods are still limited by their statistical assumption and their sensitivity to noise. In addition, they are unable to predict the risk scores of patients using multi-omics data.
To address this problem, we introduce Subtyping via Consensus Factor Analysis (SCFA), a novel method for cancer subtyping and risk prediction using consensus factor analysis. SCFA follows a three-stage hierarchical process to ensure the robustness of the discovered subtypes. First, the method uses an autoencoder to filter out genes with an insignificant contribution in characterizing each patient. Second, it applies a modified factor analysis to generate a collection of factor representations of the high-dimensional multi-omics data. Finally, it utilizes a consensus ensemble to find subtypes that are shared across all factor representations.
To install SCFA
, you need to install the R pacakge from Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("SCFA")
SCFA depends on the torch
package to build and train the autoencoders. When SCFA package is loaded, it will check for the availability of C++ libtorch
. torch
package can be used to install C++ libtorch
, which is necessary for neural network computation.
library(SCFA)
# If libtorch is not automatically installed by torch, it can be installed manually using:
torch::install_torch()
Load the example data GBM
. GBM
is the Glioblastoma cancer dataset.
#Load required library
library(SCFA)
## libtorch is not installed. Use `torch::install_torch()` to download and install libtorch
library(survival)
# Load example data (GBM dataset), for other dataset, download the rds file from the Data folder at https://bioinformatics.cse.unr.edu/software/scfa/Data/ and load the rds object
data("GBM")
# List of one matrix of microRNA data, other examples would have 3 matrices of 3 data types
dataList <- GBM$data
# Survival information
survival <- GBM$survival
We can use the main funtion SCFA
to generate subtypes from multi-omics data. The input of this function is a list of matrices from different data types. Each matrix has rows as samples and columns as features. The output of this function is subtype assignment for each patient. We can perform survival analysis to determine the significance in survival differences between discovered subtypes.
# Generating subtyping result
set.seed(1)
subtype <- SCFA(dataList, seed = 1, ncores = 4L)
# Perform survival analysis on the result
coxFit <- coxph(Surv(time = Survival, event = Death) ~ as.factor(subtype), data = survival, ties="exact")
coxP <- round(summary(coxFit)$sctest[3],digits = 20)
print(coxP)
## pvalue
## 0.01213664
We can use the function SCFA.class
to predict risk score of patients using available survival information from training data. We need to provide the function with training data with survival information, and testing data. The output is the risk score of each patient. Patient with higher risk scores have higher probablity to experience event before the other patient. Concordance index is use to confirm the correlation between predicted risk scores and survival information.
# Split data to train and test
set.seed(1)
idx <- sample.int(nrow(dataList[[1]]), round(nrow(dataList[[1]])/2) )
survival$Survival <- survival$Survival - min(survival$Survival) + 1 # Survival time must be positive
trainList <- lapply(dataList, function(x) x[idx, ] )
trainSurvival <- Surv(time = survival[idx,]$Survival, event = survival[idx,]$Death)
testList <- lapply(dataList, function(x) x[-idx, ] )
testSurvival <- Surv(time = survival[-idx,]$Survival, event = survival[-idx,]$Death)
# Perform risk prediction
result <- SCFA.class(trainList, trainSurvival, testList, seed = 1, ncores = 4L)
# Validation using concordance index
c.index <- survival::concordance(coxph(testSurvival ~ result))$concordance
print(c.index)
## [1] 0.5783241
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] survival_3.5-5 SCFA_1.8.1 knitr_1.42 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 bslib_0.4.2 compiler_4.2.3
## [4] BiocManager_1.30.20 jquerylib_0.1.4 iterators_1.0.14
## [7] tools_4.2.3 digest_0.6.31 bit_4.0.5
## [10] nlme_3.1-162 jsonlite_1.8.4 evaluate_0.20
## [13] lattice_0.20-45 pkgconfig_2.0.3 rlang_1.1.0
## [16] psych_2.3.3 igraph_1.4.1 foreach_1.5.2
## [19] Matrix_1.5-3 cli_3.6.1 yaml_2.3.7
## [22] parallel_4.2.3 RhpcBLASctl_0.23-42 xfun_0.38
## [25] coro_1.0.3 fastmap_1.1.1 cluster_2.1.4
## [28] sass_0.4.5 glmnet_4.1-7 bit64_4.0.5
## [31] grid_4.2.3 R6_2.5.1 snow_0.4-4
## [34] processx_3.8.0 BiocParallel_1.32.6 rmarkdown_2.21
## [37] bookdown_0.33 callr_3.7.3 magrittr_2.0.3
## [40] matrixStats_0.63.0 splines_4.2.3 ps_1.7.3
## [43] codetools_0.2-19 htmltools_0.5.5 mnormt_2.1.1
## [46] shape_1.4.6 torch_0.9.1 cachem_1.0.7