Contents

1 Introduction

Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. One critical unmet challenge is that molecular disease subtypes characterized by relevant clinical differences, such as survival, are difficult to differentiate. With the advancement of multi-omics technologies, subtyping methods have shifted toward data integration in order to differentiate among subtypes from a holistic perspective that takes into consideration phenomena at multiple levels. However, these integrative methods are still limited by their statistical assumption and their sensitivity to noise. In addition, they are unable to predict the risk scores of patients using multi-omics data.

To address this problem, we introduce Subtyping via Consensus Factor Analysis (SCFA), a novel method for cancer subtyping and risk prediction using consensus factor analysis. SCFA follows a three-stage hierarchical process to ensure the robustness of the discovered subtypes. First, the method uses an autoencoder to filter out genes with an insignificant contribution in characterizing each patient. Second, it applies a modified factor analysis to generate a collection of factor representations of the high-dimensional multi-omics data. Finally, it utilizes a consensus ensemble to find subtypes that are shared across all factor representations.

2 Installation

To install SCFA, you need to install the R pacakge from Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("SCFA")

SCFA depends on the torch package to build and train the autoencoders. When SCFA package is loaded, it will check for the availability of C++ libtorch. torch package can be used to install C++ libtorch, which is necessary for neural network computation.

library(SCFA)

# If libtorch is not automatically installed by torch, it can be installed manually using:
torch::install_torch()

3 Using SCFA

3.1 Preparing data

Load the example data GBM. GBM is the Glioblastoma cancer dataset.

#Load required library
library(SCFA)
## libtorch is not installed. Use `torch::install_torch()` to download and install libtorch
library(survival)

# Load example data (GBM dataset), for other dataset, download the rds file from the Data folder at https://bioinformatics.cse.unr.edu/software/scfa/Data/ and load the rds object
data("GBM")
# List of one matrix of microRNA data, other examples would have 3 matrices of 3 data types
dataList <- GBM$data
# Survival information
survival <- GBM$survival

3.2 Subtyping

We can use the main funtion SCFA to generate subtypes from multi-omics data. The input of this function is a list of matrices from different data types. Each matrix has rows as samples and columns as features. The output of this function is subtype assignment for each patient. We can perform survival analysis to determine the significance in survival differences between discovered subtypes.

# Generating subtyping result
set.seed(1)
subtype <- SCFA(dataList, seed = 1, ncores = 4L)

# Perform survival analysis on the result
coxFit <- coxph(Surv(time = Survival, event = Death) ~ as.factor(subtype), data = survival, ties="exact")
coxP <- round(summary(coxFit)$sctest[3],digits = 20)
print(coxP)
##     pvalue 
## 0.01235006

3.3 Predicting risk score

We can use the function SCFA.class to predict risk score of patients using available survival information from training data. We need to provide the function with training data with survival information, and testing data. The output is the risk score of each patient. Patient with higher risk scores have higher probablity to experience event before the other patient. Concordance index is use to confirm the correlation between predicted risk scores and survival information.

# Split data to train and test
set.seed(1)
idx <- sample.int(nrow(dataList[[1]]), round(nrow(dataList[[1]])/2) )

survival$Survival <- survival$Survival - min(survival$Survival) + 1 # Survival time must be positive

trainList <- lapply(dataList, function(x) x[idx, ] )
trainSurvival <- Surv(time = survival[idx,]$Survival, event =  survival[idx,]$Death)

testList <- lapply(dataList, function(x) x[-idx, ] )
testSurvival <- Surv(time = survival[-idx,]$Survival, event =  survival[-idx,]$Death)

# Perform risk prediction
result <- SCFA.class(trainList, trainSurvival, testList, seed = 1, ncores = 4L)

# Validation using concordance index
c.index <- survival::concordance(coxph(testSurvival ~ result))$concordance
print(c.index)
## [1] 0.5783241
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] survival_3.4-0   SCFA_1.8.0       knitr_1.40       BiocStyle_2.26.0
## 
## loaded via a namespace (and not attached):
##  [1] shape_1.4.6            xfun_0.34              bslib_0.4.0           
##  [4] clusterCrit_1.2.8      splines_4.2.1          lattice_0.20-45       
##  [7] snow_0.4-4             htmltools_0.5.3        yaml_2.3.6            
## [10] rlang_1.0.6            jquerylib_0.1.4        BiocParallel_1.32.0   
## [13] bit64_4.0.5            matrixStats_0.62.0     foreach_1.5.2         
## [16] stringr_1.4.1          codetools_0.2-18       psych_2.2.9           
## [19] evaluate_0.17          callr_3.7.2            fastmap_1.1.0         
## [22] ps_1.7.2               parallel_4.2.1         Rcpp_1.0.9            
## [25] torch_0.9.0            BiocManager_1.30.19    cachem_1.0.6          
## [28] coro_1.0.3             jsonlite_1.8.3         bit_4.0.4             
## [31] mnormt_2.1.1           RhpcBLASctl_0.21-247.1 digest_0.6.30         
## [34] stringi_1.7.8          bookdown_0.29          processx_3.8.0        
## [37] grid_4.2.1             cli_3.4.1              tools_4.2.1           
## [40] magrittr_2.0.3         sass_0.4.2             glmnet_4.1-4          
## [43] cluster_2.1.4          pkgconfig_2.0.3        Matrix_1.5-1          
## [46] rmarkdown_2.17         iterators_1.0.14       R6_2.5.1              
## [49] nlme_3.1-160           igraph_1.3.5           compiler_4.2.1