signeRFlow is a shiny app that allows users to explore mutational signatures and exposures to related mutational processes. With the available modules, users are able to perform analysis on theirs own data applying different approaches, such as de novo and fitting. Also, there is a module to explore public datasets from TCGA.
Start the app using either RStudio or a terminal:
The app will open on a new window or on a tab at your browser.
There are three available modules in the app:
You can go through the modules independently by using the app sidebar.
In this module, you can upload a SNV matrix with counts of mutations and execute the signeR de novo algorithm, which computes a Bayesian approach to the non-negative factorization (NMF) of the mutation counts in a matrix product of mutational signatures and exposures to mutational processes.
You can also provide a file with opportunities that are used as weights for the factorization. Further analysis parameters can be set, results can be visualized on different plots and found signatures can be compared to the ones in Cosmic database interactively.
You can upload a VCF file or a SNV matrix file (mandatory) with your own samples to use in signeR de novo module. You can upload an opportunity file as well or use a already built genome opportunity. Also, you can upload a BED file to build an opportunity matrix.
You can upload a VCF file or a SNV matrix file from your computer by clicking at the Browse button.
SNV matrix is a text file with a (tab-delimited) matrix of SNV counts found on analyzed genomes. It must contain one row for each genome sample and 97 columns, the first one with sample ids and, after that, one column for each mutation type. Mutations should be specified in the column names (headers), by both the base change and the trinucleotide context were it occurs (for example: C>A:ACA). The table below shows a example of the SNV matrix structure.
C>A:ACA | C>A:ACC | C>A:ACG | C>A:ACT | C>A:CCA | … | T>G:TTT | |
---|---|---|---|---|---|---|---|
PD3851a | 31 | 34 | 9 | 21 | 24 | … | 21 |
PD3904a | 110 | 91 | 9 | 87 | 108 | … | 77 |
… | … | … | … | … | … | … | … |
PD3890a | 122 | 112 | 13 | 107 | 99 | … | 50 |
If you want to upload a VCF file, you must select the genome build used on your variant calling analysis to allow signeR to generate a SNV matrix of counts. Also, you can generate a SNV matrix from a VCF file using the method:
from signeR package. See the documentation for more details.
Warning
:You must have installed the genomes
BSgenome.Hsapiens.UCSC.hg19
orBSgenome.Hsapiens.UCSC.hg38
fromBSgenome
package in order to use the VCF upload.
Columns:
The first column needs to contain the sample ID and other columns contain the 96 trinucleotide contexts.
Rows:
Each row contain the sample ID and the counts for each trinucleotide contexts.
Example file:
You can upload an Opportunity matrix file or a BED file from your computer by clicking at the Browse button. Also, you can use a already built genome opportunity for human reference genomes. This is an optional file.
Opportunity matrix is a tab-delimited text file with a matrix of counts of trinucleotide contexts found in studied genomes. It must structured as the SNV matrix, with mutations specified on the head line (for each SNV count, the Opportunity matrix shows the total number of genomic loci where the refereed mutation could have occurred). The table below shows a example of the opportunity matrix structure.
366199887 | 211452373 | 45626142 | 292410567 | 335391892 | 239339768 | … | 50233875 |
---|---|---|---|---|---|---|---|
202227618 | 116207171 | 25138239 | 161279580 | 184193767 | 131051208 | … | 177385805 |
225505378 | 130255706 | 28152934 | 179996700 | 206678032 | 147634427 | … | 199062504 |
425545790 | 245523433 | 53437284 | 339065644 | 389386002 | 278770926 | … | 375075216 |
452332390 | 259934779 | 55862550 | 361010972 | 412168035 | 292805460 | … | 396657807 |
If you want to upload a BED file, you must select the genome build used on your analysis to allow signeR to generate the opportunities for your regions file. Also, you can generate an opportunity matrix from the reference genome using the method:
from signeR package. See the documentation for more details.
Warning
:You must have installed the genomes
BSgenome.Hsapiens.UCSC.hg19
orBSgenome.Hsapiens.UCSC.hg38
fromBSgenome
package in order to use the BED upload.
Columns:
There is no header in this file and each column represents a trinucleotide context.
Rows:
Each row contains the count frequency of the trinucleotides in the whole analyzed region for each sample.
Example file:
There are some parameters that you can define before running the analysis by clicking at Start de novo analysis button:
Parameters:
Number of signatures:
define the minimal and maximal numbers of signatures you want that signeR estimates.
EM:
number of iterations performed to estimate the hiper-hiper parameters of signeR model. Ignored if previously computed values are used for those parameters (fast option).
Warm-up:
number of Gibbs sampler iterations performed in warming phase, before signeR assumes that the model have converged.
Final:
number of final Gibbs sampler iterations used to estimate signatures and exposures.
During the execution, a message will appear at the screen showing the progress. After, you can download the results by clicking the button Download Rdata below the button Start de novo analysis and can iterate with all available plots in signeR package.
signeRFlow uses COSMIC v3.2 to calculate the cosine distance between found signatures and those present in COSMIC. A heatmap will be shown at the COSMIC Comparison section of de novo tab.
In this module, you can upload a VCF file or a SNV matrix with counts of mutations, the same as used on de novo module, and a previous signatures file with known signatures to execute the signeR fitting algorithm, witch computes a Bayesian approach to the fitting of mutation counts to known mutational signatures, thus estimating exposures to mutational processes.
You can also provide a file with opportunities or use a already built genome opportunity that are used as weights for the factorization. Further analysis parameters can be set and estimated exposures can be visualized on different plots interactively.
You can upload a VCF file or a SNV matrix file with your own samples to use in signeR fitting module and previous known signatures (mandatories files). You can upload an opportunity file as well. SNV or VCF and opportunity matrix are the same as used on de novo module.
This is the same file used on de novo module.
This is the same file used on de novo module.
You can upload a Previous signatures matrix file from your computer by clicking at the Browse button.
Previous signatures is a tab-delimited text file with a matrix of previously known signatures. It must contain one column for each signature and one row for each of the 96 SNV types (considering trinucleotide contexts). Mutation types should be contained on the first column, in the same form as the column names of the SNV matrix. The table below shows a example of the previous signatures matrix structure.
Signature 2 | Signature 3 | Signature 5 | Signature 6 | … | Signature 8 | |
---|---|---|---|---|---|---|
C>A:ACA | 0.01110 | 0.00067 | 0.02218 | 0.01494 | … | 0.03672 |
C>A:ACC | 0.00915 | 0.00062 | 0.01788 | 0.00896 | … | 0.03324 |
C>A:ACG | 0.00150 | 0.00010 | 0.00213 | 0.00221 | … | 0.00252 |
… | … | … | … | … | … | … |
T>G:TTT | 0.00403 | 2.359E-05 | 0.0130 | 0.01337 | … | 0.00722 |
Columns:
The first column needs to contain the trinucleotide contexts and other columns contain the known signatures.
Rows:
Each row contains the expected frequency of the given mutation in the appointed trinucleotide context.
Example file:
There are some parameters that you can define before running the analysis by clicking at Start Fitting analysis button:
Parameters:
EM:
number of iterations performed to estimate the hiper-hiper parameters of signeR model. Ignored if previously computed values are used for those parameters (fast option).
Warm-up:
number of Gibbs sampler iterations performed in warming phase, before signeR assumes that the model have converged.
Final:
number of final Gibbs sampler iterations used to estimate signatures and exposures.
During the execution, a message will appear at the screen showing the progress. After, you can download the results by clicking the button Download Rdata below the button Start Fitting analysis and can iterate with all available plots in signeR package.
Available in all modules, you can perform downstream analysis using de novo or fitting results with your own data, or in the TCGA Explorer module.
There are two main downstream analysis:
You can access those analysis in all modules using the tabs Clustering and Covariate.
Hierarchical Clustering
By using the Hierarchical clustering section, you can select different dist and hclust methods:
When you select a new dist or hclust method, the dendogram plot is updated.
Fuzzy Clustering
By using the Fuzzy clustering section, you can set the number of groups or let the algorithm to estimate (Set groups to 1) and click at the Run fuzzy to start the analysis:
During the execution, a message will be showed at the screen showing the progress.
Warning: Fuzzy clustering can be a long process and demands high computer resources.
The output of Fuzzy clustering is shown as a heatmap plot.
To perform a Covariate analysis on signeRFlow, you must upload a clinical data, a tab-delimited file with samples in rows and features in columns. You can upload a file by clicking in the Browse… button:
Clinical data is a tab-delimited text file with a matrix of available metadata (clinical and/or survival) for each sample. It must have a first column of sample ids, named as “SampleID”, whose entries match the row names of the SNV matrix. The number and title of the remaining columns are optional, however if survival data is included it must be organized in a column named time (in months) and another named status (which contains 1 for death events and 0 for censored samples). The table below shows a example of the clinical data matrix structure.
SampleID | gender | ajcc_pathologic_stage | ethnicity | race | status | time |
---|---|---|---|---|---|---|
PD3851a | male | Stage I | not hispanic or latino | white | 0 | 236 |
PD3890a | male | Stage II | not hispanic or latino | black or african american | 1 | 199 |
PD3904a | female | Stage II | NA | NA | 0 | 745 |
PD3905a | female | Stage IV | NA | white | 1 | 299 |
PD3945a | male | Stage IV | not hispanic or latino | asian | 0 | 799 |
Columns:
The first column must contains the sample ID. Other columns may contain sample groupings or other features that you would like to co-analyze with exposure data.
Rows:
Each row contains clinical information for one sample: its ID and all other data of interest.
Example file:
After the upload, a description table summarizes the data with all the features in rows, and the class, counts and missing for each feature. By selecting a feature (row) at the table, a small panel is shown next to the table summarizing the values, categorical or continuous, for the selected feature:
According to the class of the feature, a set of analysis are available in the Plots section:
Categorical feature:
Differential Exposure Analysis: highlight signatures that are differentially active among groups of samples.
Sample Classification: classify samples based on their exposures to mutational processes.
Numeric feature:
Correlation Analysis: evaluate feature correlation to exposures to mutational signatures.
Linear Regression: relevance of exposures in final model of provided feature.
Survival feature:
Survival analysis: evaluate the effect of exposure on survival.
Cox Regression: evaluate the combined effect on survival of exposure levels to different signatures.
Some analysis also offer few parameters to perform the analysis.
Instead of uploading a private dataset, signeRFlow allows you to explore exposure data previously estimated for samples on TCGA public datasets. We executed signeR algorithm previously applied to genome samples from 33 cancer types and estimated mutational signatures and exposures were obtained for each cancer type. Also, known signatures from Cosmic database were fitted to TCGA mutation data, thus estimating related exposures on each cancer type.
You can select the cancer type of interest and the analysis type on the sidebar. Also, samples can be filtered according to available features in the metadata.
The first time you click in the button TCGA Explorer on the sidebar, signeRFlow will download all the necessary files (RData) according to cancer study and analysis type.
Warning: The files are often small, but depends on the cancer study, this process can take a while. A message will show the download and rendering progress.
Using the data summary table with all clinical data features downloaded from TCGA, you can select a feature to filter the dataset. According to the feature class, different options to filter will be shown.
It is not mandatory to filter the dataset, you can use all the cases. The aim of this resource is to allow you to explore the dataset and select the cases you work with.
Note: If you filter a dataset using the data summary table, it will be used on the downstream analysis, such as clustering and covariate.
As an example, we selected the feature ajcc_pathologic_stage from ACC cancer type and de novo analysis:
and applied the filter on the dataset, selecting only groups Stage I and Stage II:
For each change on feature and filters, the available plots are updated according to the filtered samples.
Similar to signeR analysis modules, the downstream analysis Clustering and Covariate are available on TCGA Explorer module and work the same, but you do not need to upload a clinical data in this module.
As a reminder, in the top of Covariate tab you will see an information about the dataset and used filters.
You can select a feature in the data summary table and perform a covariate analysis according to feature class.
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.31 R6_2.5.1 lifecycle_1.0.3 jsonlite_1.8.4
#> [5] magrittr_2.0.3 evaluate_0.20 stringi_1.7.12 rlang_1.0.6
#> [9] cachem_1.0.6 cli_3.6.0 jquerylib_0.1.4 bslib_0.4.2
#> [13] vctrs_0.5.1 rmarkdown_2.19 tools_4.2.2 stringr_1.5.0
#> [17] glue_1.6.2 xfun_0.36 yaml_2.3.6 fastmap_1.1.0
#> [21] compiler_4.2.2 htmltools_0.5.4 knitr_1.41 sass_0.4.4