signeR

Motivation: Cancer is an evolutionary process driven by continuous acquisition of genetic variations in individual cells. The diversity and complexity of somatic mutational processes is a conspicuous feature orchestrated by DNA damage agents and repair processes, including exogenous or endogenous mutagen exposures, defects in DNA mismatch repair and enzymatic modification of DNA. The identification of the underlying mutational processes is central to understanding of cancer origin and evolution.

The signeR package focuses on the estimation and further analysis of mutational signatures. The functionalities of this package can be divided into three categories. First, it provides tools to process VCF files and generate matrices of SNV mutation counts and mutational opportunities, both defined according to a 3bp context (mutation site and its neighbouring 3' and 5' bases). Second, these count matrices are considered as input for the estimation of the underlying mutational signatures and the number of active mutational processes. Third, the package provides tools to correlate the activities of those signatures with other relevant information such as clinical data, in order to draw conclusions about the analysed genome samples, which can be useful for clinical applications. These include the Differential Exposure Score and the a posteriori sample classification.

Although signeR is intended for the estimation of mutational signatures, it actually provides a full Bayesian treatment to the non-negative matrix factorisation (NMF) model. Further details about the method can be found in Rosales & Drummond et al., 2016 (see section 6.1 below).

2: Installation

Before installing, please make sure you have the latest version of R and Bioconductor installed.

source("https://bioconductor.org/biocLite.R")
biocLite("signeR")

OS X users might experience compilation errors due to missing gfortran libraries. Please read this section.

3: Preparing the input

signeR takes as input a count matrix of samples x features. Each feature is usually a SNV mutation within a 3bp context (96 features, 6 types of SNV mutations and 4 possibilities for the bases at each side of the SNV change). Optionally, an opportunity matrix can also be provided containing the count frequency of the features in the whole analyzed region for each sample. Although not required, this argument is highly recommended because it allows signeR to normalize the features frequency over the analyzed region.

Input matrices can be read both from a VCF or a tab-delimited files, as described next.

3.1: Input from VCF

The VCF file format is the most common format for storing genetic variations, the signeR package includes a utility function for generating a count matrix from the VCF:

library(VariantAnnotation)

# BSgenome, equivalent to the one used on the variant call
library(BSgenome.Hsapiens.UCSC.hg19)

vcfobj <- readVcf("/path/to/a/file.vcf", "hg19")
mut <- genCountMatrixFromVcf(BSgenome.Hsapiens.UCSC.hg19, vcfobj)

This function will generate a matrix of mutation counts for each sample in the provided VCF. The opportunity matrix can also be generated from the reference genome (hg19 in the following case):

library(rtracklayer)

target_regions <- import(con="/path/to/a/target.bed", format="bed")
opp <- genOpportunityFromGenome(BSgenome.Hsapiens.UCSC.hg19,
    target_regions, nsamples=nrow(mut))

Where target.bed is a bed file containing the genomic regions analyzed by the variant caller.

3.2: Input from tab-delimited file

By convention, the input file should be tab-delimited with sample names as row names and features as column names. Features should be refered in the format "base change:triplet", e.g. "C>A:TCG", as can be seen in the example below. Similarly, the opportunity matrix can be provided in a tab-delimited file with the same structure as the mutation counts file. An example of the required matrix format can be seen here.

This tutorial uses as input the 21 breast cancer dataset described in Nik-Zainal et al 2012. For the sake of convenience this dataset is included with the package and can be accessed by using the system.file function:

mut <- read.table(system.file("extdata","21_breast_cancers.mutations.txt",
    package="signeR"), header=TRUE, check.names=FALSE)
opp <- read.table(system.file("extdata","21_breast_cancers.opportunity.txt",
    package="signeR"))

4: Estimating the number of mutational processes and their signatures

signeR takes a count matrix as its only required parameter, but the user can provide an opportunity matrix as well. The algorithm allows the assessment of the number of signatures by three options, as follows.

The parameters testing_burn and testing_eval control the number of iterations used to estimate the number of signatures (default value is 1000 for both parameters). There are other arguments that may be passed on to signeR. Please have a look at signeR's manual, issued by typing help(signeR).

Whenever signeR is left to decide which number of signatures is optimal, it will search for the rank Nsig that maximizes the median Bayesian Information Criterion (BIC). After the processing is done, this information can be plotted by the following command:

BICboxplot(signatures)

Boxplot of BIC values, showing that the optimal number of signatures for this dataset is 5.

5: Results and further analyses

5.1: Plot the MCMC paths for the NMF parameters (P and E matrices)

The following instruction plots the MCMC sampled paths for each entry of the signature matrix P and their exposures, i.e. the E matrix. Only post-burnin paths are available for plotting. Those plots are useful for checking if entries have leveled off, reflecting the sampler convergence.

Paths(signatures$SignExposures)

Each plot shows the entries and exposures of one signature along sampler iterations.

5.2: Plot the signatures

Once the processing is done, the estimated signatures can be displayed in two charts, described below.

Signature barplot

SignPlot(signatures$SignExposures)

Signatures barplot with error bars reflecting the sample percentiles 0.05, 0.25, 0.75, and 0.95 for each entry.

Signature heatmap

Estimated signatures can also be visualized in a heatmap, generated by the following command:

SignHeat(signatures$SignExposures)

5.3: Plot the exposures

The relative contribution of each signature to the inspected genomes can be displayed in several ways. signeR currently provides three alternatives.

Exposure boxplot

The levels of exposure to each signature in all genome samples can also be plotted:

ExposureBoxplot(signatures$SignExposures)

Exposure barplot

The relative contribution of the signatures to the mutations found on each genome sample can also be visualized by the following command:

ExposureBarplot(signatures$SignExposures)

Barplot showing the contributions of the signatures to genome samples mutation counts.

Exposure heatmap

ExposureHeat(signatures$SignExposures)

Heatmap showing the exposures for each genome sample. Samples are grouped according to their levels of exposure to the signatures, as can be seen in the dendrogram on the left.

5.4: Differential Exposure Analysis

signeR can highlight signatures that are differentially active among previously defined groups of samples. To perform this task signeR needs a vector of group labels. In this example the samples were divided according to germline mutations at BRCA genes: groups wt, BRCA1+ and BRCA2+, taken from the description of the 21 breast cancer data set.

# group labels, respective to each row of the mutation count matrix
BRCA_labels <- c("wt","BRCA1+","BRCA2+","BRCA1+","BRCA2+","BRCA1+","BRCA1+",
    "wt","wt","wt","wt","BRCA1+","wt","BRCA2+","BRCA2+","wt","wt","wt",
    "wt","wt","wt")

diff_exposure <- DiffExp(signatures$SignExposures, labels=BRCA_labels)

Top chart: DES plot showing that the BRCA+ samples were significantly more exposed to signatures S3, S4 and S5. Bottom chart: plots showing the significant differences found when groups are compared against each other. These last plots are generated only when there are more than two groups in the analysis and any signature is found to be differentially active. Groups marked by the same letter are not significantly different for the corresponding signature.

The Pvquant vector holds the pvalues quantile of the test for each signature (by default, the 0.5 quantile, i.e. the median). The logarithms of those are considered the Differential Exposure Scores (DES). Signatures with Pvquant values below the cutoff, 0.05 by default, are considered as differentially exposed.

# pvalues
diff_exposure$Pvquant

## [1] 0.7452909713 0.4481633903 0.0017156464 0.0006192429 0.0303415234

The MostExposed vector contains the name of the group where each differentially exposed signature showed the highest levels of activity.

# most exposed group
diff_exposure$MostExposed

## [1] NA       NA       "BRCA2+" "BRCA1+" "BRCA1+"

5.5: Sample Classification

signeR can also classify unlabeled samples based on the given labels. In order to do this, those samples must correspond to NA values in the labels vector and the Classify function can be used to assign them to one of the defined groups. This example uses the sample labels defined in the DES analysis performed previously.

# note that BRCA_labels [15],[20] and [21] are set to NA
BRCA_labels <- c("wt","BRCA+","BRCA+","BRCA+","BRCA+","BRCA+","BRCA+","wt","wt",
    "wt","wt","BRCA+","wt","BRCA+",NA,"wt","wt","wt","wt",NA,NA)

Class <- Classify(signatures$SignExposures, labels=BRCA_labels)

Barplot showing the relative frequencies of assignment of each unlabeled sample to the selected group.

# Final assignments
Class$class

## [1] "BRCA+" "wt"    "wt"

# Relative frequencies of assignment to selected groups
Class$freq

## PD4116a PD4199a PD4248a 
##       1       1       1

# All assigment frequencies
Class$allfreqs

##       PD4116a-BRCA+ PD4199a-wt PD4248a-wt
## BRCA+           100          0          0
## wt                0        100        100

6: Frequently Asked Questions

6.1: Citing signeR

citation("signeR")

## 
##   Rafael A. Rosales, Rodrigo D. Drummond, Renan Valieris, Emmanuel
##   Dias-Neto, and Israel T. da Silva (2016): signeR: An empirical
##   Bayesian approach to mutational signature discovery.
##   Bioinformatics September 1, 2016
##   doi:10.1093/bioinformatics/btw572
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {signeR: An empirical Bayesian approach to mutational signature discovery.},
##     author = {Rafael A. Rosales and Rodrigo D. Drummond and Renan Valieris and Emmanuel Dias-Neto and Israel T. da Silva},
##     year = {2016},
##     journal = {Bioinformatics},
##     doi = {10.1093/bioinformatics/btw572},
##   }

6.2: Compilation errors on OS X

This problem arises when the machine is missing gfortran libraries necessary to compile RcppArmadillo and signeR. To install the missing libraries, execute these lines on a terminal:

6.3: Missing library headers

Some packages that signeR depends on requires that 3rd party library headers be installed. If you see errors like:

It means you need to install these headers with your package manager. For example on ubuntu:

7: References

L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, P. J. Campbell, and M. R. Stratton. Deciphering Signatures of Mutational Processes Operative in Human Cancer. Cell Reports, 3(1):246-259, Jan. 2013. [ DOI ]

A. Fischer, C. J. Illingworth, P. J. Campbell, and V. Mustonen. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome biology, 14(4):R39, Apr. 2013. [ DOI ]

Debug Info

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] signeR_1.4.0        NMF_0.20.6          bigmemory_4.5.19   
##  [4] bigmemory.sri_0.1.3 Biobase_2.38.0      BiocGenerics_0.24.0
##  [7] cluster_2.0.6       rngtools_1.2.4      pkgmaker_0.22      
## [10] registry_0.3        knitr_1.17         
## 
## loaded via a namespace (and not attached):
##  [1] RMySQL_0.10.13             bit64_0.9-7               
##  [3] foreach_1.4.3              assertthat_0.2.0          
##  [5] highr_0.6                  stats4_3.4.2              
##  [7] blob_1.1.0                 BSgenome_1.46.0           
##  [9] GenomeInfoDbData_0.99.1    Rsamtools_1.30.0          
## [11] progress_1.1.2             RSQLite_2.0               
## [13] lattice_0.20-35            digest_0.6.12             
## [15] GenomicRanges_1.30.0       RColorBrewer_1.1-2        
## [17] XVector_0.18.0             colorspace_1.3-2          
## [19] Matrix_1.2-11              plyr_1.8.4                
## [21] XML_3.98-1.9               biomaRt_2.34.0            
## [23] zlibbioc_1.24.0            xtable_1.8-2              
## [25] scales_0.5.0               BiocParallel_1.12.0       
## [27] tibble_1.3.4               IRanges_2.12.0            
## [29] ggplot2_2.2.1              SummarizedExperiment_1.8.0
## [31] GenomicFeatures_1.30.0     lazyeval_0.2.1            
## [33] magrittr_1.5               mime_0.5                  
## [35] memoise_1.1.0              evaluate_0.10.1           
## [37] doParallel_1.0.11          class_7.3-14              
## [39] PMCMR_4.1                  tools_3.4.2               
## [41] prettyunits_1.0.2          matrixStats_0.52.2        
## [43] gridBase_0.4-7             stringr_1.2.0             
## [45] S4Vectors_0.16.0           munsell_0.4.3             
## [47] DelayedArray_0.4.0         AnnotationDbi_1.40.0      
## [49] Biostrings_2.46.0          compiler_3.4.2            
## [51] GenomeInfoDb_1.14.0        rlang_0.1.2               
## [53] grid_3.4.2                 RCurl_1.95-4.8            
## [55] nloptr_1.0.4               iterators_1.0.8           
## [57] VariantAnnotation_1.24.0   bitops_1.0-6              
## [59] gtable_0.2.0               codetools_0.2-15          
## [61] DBI_0.7                    markdown_0.8              
## [63] reshape2_1.4.2             R6_2.2.2                  
## [65] GenomicAlignments_1.14.0   rtracklayer_1.38.0        
## [67] bit_1.1-12                 stringi_1.1.5             
## [69] Rcpp_0.12.13

Table of contents