The consICa package: User’s manual

Petr V. Nazarov, Tony Kaoma, Maryna Chepeleva

2022-11-01

Abstract

This vignette describes the Bioconductor package consICA. We developed a fast parallelized algorithm for deconvolution of omics data, based on the independent component analysis. Additionally, we implemented convenient functionality for the analysis of decomposed signals including ANOVA-based linking to experimental factors, enrichment analysis, and survival analysis. A PDF report can be generated automatically by the package. The package was previously applied to single-cell and bulk-sample transcriptomics data in the field of cancer research. We showed the ability of the approach to (i) separate biologically relevant molecular signals from technical biases, (ii) identify molecular signals specific to cell types and cellular processes, (iii) provide a feature engineering mechanism improving the classification of the samples [1,2,3].

Package

consICA 1.0.0

1 Introduction
2 Installing and loading the package
3 Example dataset
4 Consensus independent component analysis
5 Enrichment analysis
6 Survival analysis
7 Automatic report generation
8 Session info
9 References

This document describes the usage of the functions integrated into the package and is meant to be a reference document for the end-user.

1 Introduction

Independent component analysis (ICA) of omics data can be used for deconvolution of biological signals and technical biases, correction of batch effects, feature engineering for classification and integration of the data [4]. The consICA package allows building robust consensus ICA-based deconvolution of the data as well as running post-hoc biological interpretation of the results. The implementation of parallel computing in the package ensures efficient analysis using modern multicore systems.

2 Installing and loading the package

Load the package with the following command:

if (!requireNamespace("BiocManager", quietly = TRUE)) { 
    install.packages("BiocManager")
}
BiocManager::install("consICA")

library('consICA')
#> 
#>

3 Example dataset

The usage of the package functions for an independent component deconvolution is shown for an example set of 472 samples and 2454 most variable genes from the skin cutaneous melanoma (SKCM) TCGA cohort. It stored as a SummarizedExperiment object. In addition the data includes metadata such as age, gender, cancer subtype etc. and survival time-to-event data. Use data as

library(SummarizedExperiment, verbose = FALSE)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
data("samples_data")

samples_data
#> class: SummarizedExperiment 
#> dim: 2454 472 
#> metadata(0):
#> assays(1): X
#> rownames(2454): A2ML1 AACSP1 ... ZNF883 ZP1
#> rowData names(0):
#> colnames(472): 3N.A9WB.Metastatic 3N.A9WC.Metastatic ...
#>   Z2.AA3S.Metastatic Z2.AA3V.Metastatic
#> colData names(103): age_at_initial_pathologic_diagnosis gender ... time
#>   event
# According to our terminology expression matrix X of 2454 genes and 472 samples
X <- data.frame(assay(samples_data))
dim(X)
#> [1] 2454  472
# variables described each sample
Var <- data.frame(colData(samples_data))
dim(Var)
#> [1] 472 103
colnames(Var)[1:20] # print first 20
#>  [1] "age_at_initial_pathologic_diagnosis" "gender"                             
#>  [3] "race"                                "Weight"                             
#>  [5] "Height"                              "BMI"                                
#>  [7] "Ethinicity"                          "Cancer.Type.Detailed"               
#>  [9] "Tumor.location.site"                 "ajcc_pathologic_tumor_stage"        
#> [11] "Cancer.stage.M"                      "Cancer.stage.N"                     
#> [13] "Cancer.stage.T"                      "initial_pathologic_dx_year"         
#> [15] "birth_days_to_initial_diagnosis"     "Year.of.Birth"                      
#> [17] "age.at.last.contact"                 "Year.of.last.contact"               
#> [19] "last_contact_days_to"                "age.at.death"
# survival time and event for each sample
Sur <- data.frame(colData(samples_data))[,c("time", "event")]
Var <- Var[,-which(colnames(Var) != "time" & colnames(Var) != "event")]

4 Consensus independent component analysis

Independent component analysis (ICA) is an unsupervised method of signal deconvolution [3]. ICA decomposes combined gene expression matrix from multiple samples into meaningful signals in the space of genes (metagenes, S) and weights in the space of samples (weight matrix, M). Figure 1. ICA decomposes combined gene expression matrix from multiple samples into meaningful signals in the space of genes (metagenes, S) and weights in the space of samples (weight matrix, M). Biological processes and signatures of cell subtypes can be found in S, while M could be linked to patient groups and patient survival (Cox regression and Eq.1).

The consICA function calculates the consensus ICA for an expression matrix: X = 𝑆 × 𝑀, where S is a matrix of statistically independent and biologically meaningful signals (metagenes) and M – their weights (metasamples). You can set the number of components, the number of consensus runs. The function will print logs every show.every run. Use ncores for parallel calculation. To filter out genes (rows) with values lower than a threshold set the filter.thr argument.

set.seed(2022)
cica <- consICA(samples_data, ncomp=40, ntry=10, show.every=0)

The output of consensus ICA is a list with input data X (assays of SummarizedExperiment samples_data object) and it dimensions, consensus metagene S and weight matrix M, number of components, mean R2 between rows of M (mr2), stability as mean R2 between consistent columns of S in multiple tries (stab) and number of best iteration (ibest). For compact output use reduced = TRUE.

str(cica, max.level=1)
#> List of 9
#>  $ X        :Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
#>  $ S        : num [1:2454, 1:40] -0.0634 -2.2511 0.4664 -0.1083 0.2724 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ M        : num [1:40, 1:472] 0.2433 -0.392 0.0627 -0.3946 -0.0956 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ mr2      : num [1:10] 0.0261 0.0265 0.0277 0.0268 0.0286 ...
#>  $ i.best   : int 10
#>  $ stab     : num [1:10, 1:40] 0.988 0.986 0.987 0.988 0.985 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ ncomp    : num 40
#>  $ nsamples : int 472
#>  $ nfeatures: int 2454

Now we can extract the names of features (rows in X and S matrices) and their false discovery rates values for positive/negative contribution to each component. Use getFeatures to extract them.

features <- getFeatures(cica, alpha = 0.05, sort = TRUE)
#Positive and negative affecting features for first components are
head(features$ic.1$pos)
#>          features          fdr
#> SOX1         SOX1 9.034615e-09
#> USP43       USP43 4.493211e-07
#> DMRT2       DMRT2 3.726509e-06
#> TRIM51CP TRIM51CP 4.229424e-04
#> FAM19A5   FAM19A5 9.126773e-04
#> GABRA5     GABRA5 1.175731e-03
head(features$ic.1$neg)
#>         features           fdr
#> MAGEA6    MAGEA6 1.927747e-140
#> MAGEA3    MAGEA3 1.394084e-138
#> MAGEA12  MAGEA12 2.833615e-135
#> CSAG1      CSAG1 1.510751e-118
#> MAGEC1    MAGEC1 9.601315e-109
#> MAGEC2    MAGEC2 2.041645e-104

Two lists of top-contributing genes resulted from the getFeatures – positively and negatively involved. The plot of genes contribution to the first component (metagene) is shown below.

icomp <- 1
plot(sort(cica$S[,icomp]),col = "#0000FF", type="l", ylab=("involvement"),xlab="genes",las=2,cex.axis=0.4, main=paste0("Metagene #", icomp, "\n(involvement of features)"),cex.main=0.6)

Estimate the variance explained by the independent components with estimateVarianceExplained. Use plotICVarianceExplained to plot it.

var_ic <- estimateVarianceExplained(cica)
p <- plotICVarianceExplained(cica)

5 Enrichment analysis

For each component we can carry out an enrichment analysis. The genes with expression above the selected threshold in at least one sample, were used as a background gene list and significantly overrepresented(adj.p-value < alpha) GO terms. You can select the db for search: biological process (“BP”), molecular function (“MF”) and/or cellular component (“CC”).

## To save time for this example load result of getGO(cica, alpha = 0.05, db = c("BP", "MF", "CC"))
if(!file.exists("GOs_40_s2022.rds")){
  GOs <- readRDS(url("http://edu.modas.lu/data/consICA/GOs_40_s2022.rds", "r"))
  saveRDS(GOs, "GOs_40_s2022.rds")
} else{
    GOs <- readRDS("GOs_40_s2022.rds")
}
## Search GO (biological process)
# GOs <- getGO(cica, alpha = 0.05, db = "BP")
## Search GO (biological process, molecular function, cellular component)
# GOs <- getGO(cica, alpha = 0.05, db = c("BP", "MF", "CC"))

6 Survival analysis

Weights of the components (rows of matrix M) can be statistically linked to patient survival using Cox partial hazard regression [4]. In survivalAnalysis function adjusted p-values of the log-rank test are used to select significant components. However, the prognostic power of each individual component might not have been high enough to be applied to the patients from the new cohort. Therefore, survivalAnalysis integrates weights of several components, calculating the risk score (RS) with improved prognostic power. For each sample, its RS is the sum of the products of significant log-hazard ratios (LHR) of the univariable Cox regression, the component stability R2 and the standardized row of weight matrix M.

RS <- survivalAnalysis(cica, surv = Sur)

cat("Hazard score for significant components")
#> Hazard score for significant components
RS$hazard.score
#>       ic.3       ic.6       ic.7      ic.11      ic.12      ic.19      ic.20 
#> -0.2075170 -0.3608106  0.3660896 -0.3016671 -0.1889754 -0.1950913  0.3077952 
#>      ic.24      ic.28      ic.29      ic.32      ic.33      ic.39 
#>  0.2092469 -0.1685342 -0.2196184 -0.1920771 -0.2922595  0.2689534

7 Automatic report generation

The best way to get a full description of extracted components is using our automatic report. We union all analyses of independent components into a single function-generated report.

# Generate report with independent components description
if(FALSE){
  saveReport(cica, GO=GOs, Var=Var, surv = Sur)
}

The copy of this report you can find at http://edu.modas.lu/data/consICA/report_ICA_40.pdf

# delete loaded file after vignette run
unlink("GOs_40_s2022.rds")

8 Session info

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.28.0 Biobase_2.58.0             
#>  [3] GenomicRanges_1.50.0        GenomeInfoDb_1.34.0        
#>  [5] IRanges_2.32.0              S4Vectors_0.36.0           
#>  [7] BiocGenerics_0.44.0         MatrixGenerics_1.10.0      
#>  [9] matrixStats_0.62.0          consICA_1.0.0              
#> [11] BiocStyle_2.26.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.4             sass_0.4.2             splines_4.2.1         
#>  [4] bit64_4.0.5            jsonlite_1.8.3         topGO_2.50.0          
#>  [7] bslib_0.4.0            sm_2.2-5.7.1           highr_0.9             
#> [10] BiocManager_1.30.19    blob_1.2.3             GenomeInfoDbData_1.2.9
#> [13] yaml_2.3.6             RSQLite_2.2.18         lattice_0.20-45       
#> [16] digest_0.6.30          RColorBrewer_1.1-3     XVector_0.38.0        
#> [19] colorspace_2.0-3       fastICA_1.2-3          htmltools_0.5.3       
#> [22] Matrix_1.5-1           pkgconfig_2.0.3        pheatmap_1.0.12       
#> [25] SparseM_1.81           magick_2.7.3           bookdown_0.29         
#> [28] zlibbioc_1.44.0        GO.db_3.16.0           scales_1.2.1          
#> [31] BiocParallel_1.32.0    KEGGREST_1.38.0        cachem_1.0.6          
#> [34] cli_3.4.1              survival_3.4-0         magrittr_2.0.3        
#> [37] crayon_1.5.2           memoise_2.0.1          evaluate_0.17         
#> [40] graph_1.76.0           tools_4.2.1            org.Hs.eg.db_3.16.0   
#> [43] lifecycle_1.0.3        stringr_1.4.1          munsell_0.5.0         
#> [46] DelayedArray_0.24.0    AnnotationDbi_1.60.0   Biostrings_2.66.0     
#> [49] compiler_4.2.1         jquerylib_0.1.4        rlang_1.0.6           
#> [52] grid_4.2.1             RCurl_1.98-1.9         bitops_1.0-7          
#> [55] rmarkdown_2.17         gtable_0.3.1           codetools_0.2-18      
#> [58] DBI_1.1.3              R6_2.5.1               knitr_1.40            
#> [61] fastmap_1.1.0          bit_4.0.4              stringi_1.7.8         
#> [64] parallel_4.2.1         Rcpp_1.0.9             vctrs_0.5.0           
#> [67] png_0.1-7              xfun_0.34

9 References

Golebiewska, A., et al. Patient-derived organoids and orthotopic xenografts of primary and recurrent gliomas represent relevant patient avatars for precision oncology. Acta Neuropathol 2020;140(6):919-949.
Scherer, M., et al. Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz. Nat Protoc 2020;15(10):3240-3263.
Sompairac, N.; Nazarov, P.V.; Czerwinska, U.; Cantini, L.; Biton, A,; Molkenov, A,; Zhumadilov, Z.; Barillot, E.; Radvanyi, F.; Gorban, A.; Kairov, U.; Zinovyev, A. Independent component analysis for unravelling complexity of cancer omics datasets. International Journal of Molecular Sciences 20, 18 (2019). https://doi.org/10.3390/ijms20184414

4.Nazarov, P.V., Wienecke-Baldacchino, A.K., Zinovyev, A. et al. Deconvolution of transcriptomes and miRNomes by independent component analysis provides insights into biological processes and clinical outcomes of melanoma patients. BMC Med Genomics 12, 132 (2019). https://doi.org/10.1186/s12920-019-0578-4