1 Introduction

For a more detailed explanation of the VSClust function and the workflow, please take a look on the vignette for running the VSClust workflow.

Here, we present an example script to integrate the clustering with data object from Bioconductor, such as QFeatures, SummarizedExperiment and MultiAssayExperiment.

2 Installation and additional packages

Use the common Bioconductor commands for installation:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("vsclust")

The full functionality can be obtained by additionally installing and loading the packages yaml, shiny, clusterProfiler, and matrixStats.

3 Initialization

Here, we define the different parameters for the data set RNASeq2GeneNorm from the miniACC object.

The number of replicates and experimental conditions will be retrieved automatically by specifying the metadata for the grouping.

#### Input parameters, only read when now parameter file was provided #####
## All principal parameters for running VSClust can be defined as in the 
## shiny app at computproteomics.bmb.sdu.dk/Apps/VSClust
# name of study
Experiment <- "miniACC" 
# Paired or unpaired statistical tests when carrying out LIMMA for 
# statistical testing
isPaired <- FALSE
# Number of threads to accelerate the calculation (use 1 in doubt)
cores <- 1 

# If 0 (default), then automatically estimate the cluster number for the 
# vsclust run from the Minimum Centroid Distance
PreSetNumClustVSClust <- 0 
# If 0 (default), then automatically estimate the cluster number for the 
# original fuzzy c-means from the Minimum Centroid Distance
PreSetNumClustStand <- 0 

# max. number of clusters when estimating the number of clusters. 
# Higher numbers can drastically extend the computation time.
maxClust <- 10

4 Statistics and data preprocessing

At first, we load will log-transform the original data and normalize it to the median. Statistical testing will be applied on the resulting object. After estimating the standard deviations, the matrix consists of the averaged quantitative feature values and a last column for the standard deviations of the features.

We will separate the samples according to their OncoSign.

data(miniACC, package="MultiAssayExperiment")
# log-transformation and remove of -Inf values
logminiACC <- log2(assays(miniACC)$RNASeq2GeneNorm)
logminiACC[!is.finite(logminiACC)] <- NA
# normalize to median
logminiACC <- t(t(logminiACC) - apply(logminiACC, 2, median, na.rm=TRUE))

miniACC2 <- c(miniACC, log2rnaseq = logminiACC, mapFrom=1L)

## Warning: Assuming column order in the data provided 
##  matches the order in 'mapFrom' experiment(s) colnames

boxplot(logminiACC)

#### running statistical analysis and estimation of individual variances
statOut <- PrepareSEForVSClust(miniACC2, "log2rnaseq", 
                               coldatname = "OncoSign", 
                               isPaired=isPaired, isStat=TRUE)

## -- The following categories will be used as experimental 
##               conditions:
## CN2
## TP53/NF1
## TERT/ZNRF3
## CN1
## Unclassified
## NA
## CTNNB1

## -- Extracted NumReps: 21 and NumCond: 7

We can see that there is no good separation of cancer signatures on the PCA plot.

5 Estimation of cluster number

There is no simple way to find the optimal number of clusters in a data set. For obtaining this number, we run the clustering for different cluster numbers and evaluate them via so-called validity indices, which provide information about suitable cluster numbers. VSClust uses mainly the “Maximum centroid distances” that denotes the shortest distance between any of the centroids. Alternatively, one can inspect the Xie Beni index.

The output of estimClustNum contains the suggestion for the number of clusters.

We further visualize the outcome.

#### Estimate number of clusters with maxClust as maximum number clusters to run 
#### the estimation with
ClustInd <- estimClustNum(statOut$dat, maxClust=maxClust, cores=cores)

## Running cluster number 3

## Running cluster number 4

## Running cluster number 5

## Running cluster number 6

## Running cluster number 7

## Running cluster number 8

## Running cluster number 9

## Running cluster number 10

#### Use estimate cluster number or use own
if (PreSetNumClustVSClust == 0)
  PreSetNumClustVSClust <- optimalClustNum(ClustInd)
if (PreSetNumClustStand == 0)
  PreSetNumClustStand <- optimalClustNum(ClustInd, method="FCM")
#### Visualize
  estimClust.plot(ClustInd)

Both validity indices agree with each other and suggest 7 as the most reasonable estimate for the cluster number. However, we can also see that this decreases the number of clustered features quite drastically from over 150 to about 90.

6 Run final clustering

Now we run the clustering again with the optimal parameters from the estimation. One can take alternative numbers of clusters corresponding to large decays in the Minimum Centroid Distance or low values of the Xie Beni index.

First, we carry out the variance-sensitive method

#### Run clustering (VSClust and standard fcm clustering
ClustOut <- runClustWrapper(statOut$dat, 
                            PreSetNumClustVSClust, NULL, 
                            VSClust=TRUE,
                            cores=cores)
Bestcl <- ClustOut$Bestcl
VSClust_cl <- Bestcl

We see how different groups of genes show distinctive pattern of their expression in different oncological signatures.

sessionInfo()

## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] MultiAssayExperiment_1.30.0 SummarizedExperiment_1.34.0
##  [3] Biobase_2.64.0              GenomicRanges_1.56.0       
##  [5] GenomeInfoDb_1.40.0         IRanges_2.38.0             
##  [7] S4Vectors_0.42.0            BiocGenerics_0.50.0        
##  [9] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [11] vsclust_1.6.0               BiocStyle_2.32.0           
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5            xfun_0.43               bslib_0.7.0            
##  [4] ggplot2_3.5.1           lattice_0.22-6          vctrs_0.6.5            
##  [7] tools_4.4.0             generics_0.1.3          parallel_4.4.0         
## [10] tibble_3.2.1            fansi_1.0.6             highr_0.10             
## [13] BiocBaseUtils_1.6.0     pkgconfig_2.0.3         Matrix_1.7-0           
## [16] lifecycle_1.0.4         GenomeInfoDbData_1.2.12 stringr_1.5.1          
## [19] compiler_4.4.0          tinytex_0.50            statmod_1.5.0          
## [22] munsell_0.5.1           httpuv_1.6.15           htmltools_0.5.8.1      
## [25] sass_0.4.9              yaml_2.3.8              later_1.3.2            
## [28] pillar_1.9.0            crayon_1.5.2            jquerylib_0.1.4        
## [31] DelayedArray_0.30.0     cachem_1.0.8            limma_3.60.0           
## [34] magick_2.8.3            abind_1.4-5             mime_0.12              
## [37] tidyselect_1.2.1        digest_0.6.35           stringi_1.8.3          
## [40] dplyr_1.1.4             reshape2_1.4.4          bookdown_0.39          
## [43] splines_4.4.0           fastmap_1.1.1           grid_4.4.0             
## [46] colorspace_2.1-0        cli_3.6.2               SparseArray_1.4.0      
## [49] magrittr_2.0.3          S4Arrays_1.4.0          utf8_1.2.4             
## [52] promises_1.3.0          UCSC.utils_1.0.0        scales_1.3.0           
## [55] rmarkdown_2.26          XVector_0.44.0          httr_1.4.7             
## [58] qvalue_2.36.0           shiny_1.8.1.1           evaluate_0.23          
## [61] knitr_1.46              rlang_1.1.3             Rcpp_1.0.12            
## [64] xtable_1.8-4            glue_1.7.0              BiocManager_1.30.22    
## [67] jsonlite_1.8.8          plyr_1.8.9              R6_2.5.1               
## [70] zlibbioc_1.50.0

VSClust with Bioconductor objects

1 May 2024

Abstract

Package