Inferring Somatic Signatures from Single Nucleotide Variant Calls
Table of Contents
- 1. Motivation: The Concept Behind Mutational Signatures
- 2. Methodology: From Mutations to Somatic Signatures
- 3. Workflow and Implementation: Analysis with the SomaticSignatures Package
- 4. Use case: Estimating Somatic Signatures from TCGA WES Studies
- 4.1. Data: Preproccessing of the TCGA WES Studies
- 4.2. Motifs: Extracting the Sequence Context of Somatic Variants
- 4.3. Decomposition: Inferring Somatic Signatures
- 4.4. Visualization: Exploration of Signatures and Samples
- 4.5. Extensions: Normalization of Sequence Motif Frequencies and Batch Effects
- 5. Alternatives: Inferring Somatic Signatures with Different Approaches
- 6. Frequently Asked Questions
- 7. References
- 8. Session Info
1 Motivation: The Concept Behind Mutational Signatures
Recent publications introduced the concept of identifying mutational signatures from cancer sequencing studies and linked them to potential mutation generation processes [6] [14] [11]. Conceptually, this relates somatically occurring single nucleotide variants (SNVs) to the surrounding sequence which will be referred to as mutational or somatic motifs in the following. Based on the frequency of the motifs occurring in multiple samples, these can be decomposed mathematically into so called mutational signatures. In case of the investigation of tumors, the term somatic signatures will be used here to distinguish them from germline mutations and their generating processes.
The SomaticSignatures
package provides an efficient and user-friendly
implementation for the extraction of somatic motifs based on a list of
somatically mutated genomic sites and the estimation of somatic signatures with
a number of statistical approaches.
2 Methodology: From Mutations to Somatic Signatures
The basic idea of somatic signatures is composed of two parts:
Firstly, each somatic mutation is described in relation of the sequence context
in which it occurs. As an example, consider a SNV, resulting in the alteration
from A
in the normal to G
in the tumor sample, that is embedded in the sequence
context TAC
. Thus, the somatic motif can be written as TAC>TGC
or T.C
A>G
. The frequency of these motifs across multiple samples is then represented
as a matrix \(M_{ij}\), where \(i\) counts over the motifs and \(j\) over the samples.
In a second step, the matrix \(M\) is numerically decomposed into two matrices \(W\) and \(H\)
$$M_{ij} = \sum_{k=1}^{R} W_{ik} H_{jk}$$
for a fixed number \(R\) of signatures. While \(W\) describes the composition of each signature in term of somatic motifs, \(H\) shows the contribution of the signature to the alterations present in each sample.
3 Workflow and Implementation: Analysis with the SomaticSignatures Package
The SomaticSignatures
package offers a framework for inferring signatures of
SNVs in a user-friendly and efficient manner for large-scale data sets. A tight
integration with standard data representations of the Bioconductor
project
[1] was a major design goal. Further, it extends
the selection of multivariate statistical methods for the matrix decomposition
and allows a simple visualization of the results.
For a typical workflow, a set of variant calls and the reference sequence are
needed. Ideally, the SNVs are represented as a VRanges
object with the
genomic location as well as reference and alternative allele defined. The
reference sequence can be, for example, a FaFile
object, representing an
indexed FASTA file, a BSgenome
object, or a GmapGenome
object.
Alternatively, we provide functions to extract the relevant information from
other sources of inputs. At the moment, this covers the MuTect
[12] variant caller and the h5vc package
[15] [9].
Generally, the individual steps of the analysis can be summarized as:
- The somatic motifs for each variant are retrieved from the reference sequence
with the
mutationContext
function and converted to a matrix representation with themutationContextMatrix
function. - Somatic signatures are estimated with a method of choice (currently
nmfSignatures
,pcaSignatures
, orkmeansSignatures
). - The somatic signatures and their representation in the samples are assessed with a set of accessor and plotting functions.
To decompose \(M\), the SomaticSignatures
package implements three methods:
- Non-negative matrix factorization (NMF)
- The NMF decomposes \(M\) with the constraint of positive components in \(W\) and \(H\) [4]. The method was used [6] for the identification of mutational signatures, and can be computationally expensive for large data sets.
- Principal component analysis (PCA)
- The PCA employs the eigenvalue decomposition and is therefore suitable for large data sets [2]. While this is related to the NMF, no constraint on the sign of the elements of \(W\) and \(H\) exists.
- K-means clustering
- In contrast to the NMF and PCA, the k-means clustering assigns each sample to a single somatic signature, such that \(H\) can only contain the value 0 and 1. This approach is suited for large data sets.
4 Use case: Estimating Somatic Signatures from TCGA WES Studies
In the following, the concept of somatic signatures and the steps for inferring
these from an actual biological data set are shown. For the example, somatic
variant calls from whole exome sequencing (WES) studies from The Cancer Genome
Atlas (TCGA) project will be used, which are part of the
SomaticCancerAlterations
package.
library(SomaticSignatures)
library(GenomicRanges) library(VariantAnnotation) library(ggplot2) library(stringr)
library(SomaticCancerAlterations) library(BSgenome.Hsapiens.UCSC.hg19)
4.1 Data: Preproccessing of the TCGA WES Studies
The SomaticCancerAlterations
package provides the somatic SNV calls for eight
WES studies, each investigating a different cancer type. The metadata
summarizes the biological and experimental settings of each study.
sca_metadata = scaMetadata() print(sca_metadata)
Cancer_Type Center NCBI_Build Sequence_Source Sequencing_Phase Sequencer Number_Samples gbm_tcga GBM broad.mit.edu 37 WXS Phase_I Illumina GAIIx 291 hnsc_tcga HNSC broad.mit.edu 37 Capture Phase_I Illumina GAIIx 319 kirc_tcga KIRC broad.mit.edu 37 Capture Phase_I Illumina GAIIx 297 luad_tcga LUAD broad.mit.edu 37 WXS Phase_I Illumina GAIIx 538 lusc_tcga LUSC broad.mit.edu 37 WXS Phase_I Illumina GAIIx 178 ov_tcga OV broad.mit.edu 37 WXS Phase_I Illumina GAIIx 142 skcm_tcga SKCM broad.mit.edu 37 Capture Phase_I Illumina GAIIx 266 thca_tcga THCA broad.mit.edu 37 WXS Phase_I Illumina GAIIx 406 Number_Patients Cancer_Name gbm_tcga 291 Glioblastoma multiforme hnsc_tcga 319 Head and Neck squamous cell carcinoma kirc_tcga 293 Kidney Chromophobe luad_tcga 519 Lung adenocarcinoma lusc_tcga 178 Lung squamous cell carcinoma ov_tcga 142 Ovarian serous cystadenocarcinoma skcm_tcga 264 Skin Cutaneous Melanoma thca_tcga 403 Thyroid carcinoma
In this example, all mutational calls of a study will be pooled together, in
order to find signatures related to a specific cancer type. The data of all
studies is loaded and merged into a single GRanges
object, with each entry
describing a somatic variant call. Further on, only SNVs located on the human
autosomes will be considered. For later analyzes, each variant is also
associated with the study it originated from.
sca_all = scaLoadDatasets() sca_merge = unlist(sca_all) short_names = str_split_fixed(rownames(sca_metadata), "_", 2)[, 1] names(sca_merge) = sca_merge$study = factor(rep(short_names, times = elementLengths(sca_all))) sca_merge = sca_merge[sca_merge$Variant_Type %in% "SNP"] sca_merge = keepSeqlevels(sca_merge, hsAutosomes())
To get a first impression of the data, we count the number of somatic variants per study and their predicted effects.
sort(table(sca_merge$study), decreasing = TRUE)
luad skcm hnsc lusc kirc gbm thca ov 208724 200589 67125 61485 24158 19938 6716 5872
sort(table(sca_merge$Variant_Classification), decreasing = TRUE)
Missense_Mutation Silent Nonsense_Mutation Splice_Site RNA 377800 163535 27299 13934 11285 Nonstop_Mutation Translation_Start_Site Intron IGR 3'UTR 441 270 33 5 3 5'Flank 5'UTR Frame_Shift_Del Frame_Shift_Ins In_Frame_Del 1 1 0 0 0 In_Frame_Ins 0
The starting point of the analysis is a VRanges
object which describes the
somatic variants in terms of their genomic locations as well as reference and
alternative alleles. For more details about this class and how to construct it,
please see the documentation of the VariantAnnotation
package
[5]. Since the genomic positions are given
in the NCBI notation and the references used later are in UCSC notation, the
functions ucsc
and ncbi
are used to easily switch between the two notations.
sca_vr = VRanges(seqnames(sca_merge), ranges(sca_merge), ref = sca_merge$Reference_Allele, alt = sca_merge$Tumor_Seq_Allele2, seqinfo = seqinfo(sca_merge)) sca_vr = ucsc(sca_vr) head(sca_vr, 3)
VRanges with 3 ranges and 0 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepthgbm chr1 [887446, 887446] + G A gbm chr1 [909247, 909247] + C T gbm chr1 [978952, 978952] + C T sampleNames softFilterMatrix gbm gbm gbm --- seqlengths: chr1 chr2 chr3 chr4 chr5 chr6 ... chr18 chr19 chr20 chr21 chr22 249250621 243199373 198022430 191154276 180915260 171115067 ... 78077248 59128983 63025520 48129895 51304566 hardFilters: NULL
4.2 Motifs: Extracting the Sequence Context of Somatic Variants
In a first step, the sequence motif for each variant is extracted based on the
reference sequence. Here, the BSgenomes
object for the human hg19 reference
is used. However, all objects with a defined getSeq
method can serve as the
reference, e.g. an indexed FASTA file. Additionally, we transform all motifs to
have a pyrimidine base (C
or T
) as a reference base
[14].
sca_motifs = mutationContext(sca_vr, BSgenome.Hsapiens.UCSC.hg19, unify = TRUE)
sca_motifs$study = sca_merge$study head(sca_motifs, 3)
VRanges with 3 ranges and 3 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepthgbm chr1 [887446, 887446] + G A gbm chr1 [909247, 909247] + C T gbm chr1 [978952, 978952] + C T sampleNames softFilterMatrix | alteration context study | gbm | CT G.G gbm gbm | CT A.G gbm gbm | CT G.G gbm --- seqlengths: chr1 chr2 chr3 chr4 chr5 chr6 ... chr18 chr19 chr20 chr21 chr22 249250621 243199373 198022430 191154276 180915260 171115067 ... 78077248 59128983 63025520 48129895 51304566 hardFilters: NULL
To continue with the estimation of the somatic signatures, the matrix \(M\) of the
form {motifs × studies} is constructed. The normalize
argument specifies
that frequencies rather than the actual counts are returned.
sca_occurrence = mutationContextMatrix(sca_motifs, group = "study", normalize = TRUE) head(round(sca_occurrence, 4))
gbm hnsc kirc luad lusc ov skcm thca CA A.A 0.0083 0.0098 0.0126 0.0200 0.0165 0.0126 0.0014 0.0077 CA A.C 0.0093 0.0082 0.0121 0.0217 0.0156 0.0192 0.0009 0.0068 CA A.G 0.0026 0.0061 0.0046 0.0144 0.0121 0.0060 0.0004 0.0048 CA A.T 0.0057 0.0051 0.0070 0.0134 0.0100 0.0092 0.0007 0.0067 CA C.A 0.0075 0.0143 0.0215 0.0414 0.0390 0.0128 0.0060 0.0112 CA C.C 0.0075 0.0111 0.0138 0.0415 0.0275 0.0143 0.0018 0.0063
The observed occurrence of the motifs, also termed somatic spectrum, can be visualized across studies, which gives a first impression of the data. The distribution of the motifs clearly varies between the studies.
plotSamplesObserved(sca_motifs)
4.3 Decomposition: Inferring Somatic Signatures
The somatic signatures can be estimated with each of the statistical methods
implemented in the package. Here, we will use the NMF
and PCA
, and compare
the results. Prior to the estimation, the number \(R\) of signatures to obtain has to
be fixed; in this example, the data will be decomposed into 5 signatures.
n_sigs = 5 sigs_nmf = nmfSignatures(sca_occurrence, r = n_sigs) sigs_pca = pcaSignatures(sca_occurrence, r = n_sigs)
The results contains the decomposed matrices stored in a list and can be accessed using standard R accessor functions.
names(sigs_nmf)
[1] "w" "h" "v" "raw"
sapply(sigs_nmf, dim)
$w [1] 96 5 $h [1] 8 5 $v [1] 96 8 $raw [1] 96 8 5
head(sigs_nmf$w, 3)
S1 S2 S3 S4 S5 CA A.A 5.107e-06 2.375e-16 2.047e-13 1.279 0.52570 CA A.C 6.276e-02 2.220e-16 2.220e-16 1.320 0.47930 CA A.G 1.212e-14 1.962e-07 2.792e-03 1.012 0.06564
head(sigs_nmf$h, 3)
S1 S2 S3 S4 S5 gbm 0.037329 0.002054 0.002269 0.002543 0.007775 hnsc 0.008512 0.008109 0.010338 0.006018 0.004097 kirc 0.008999 0.003468 0.001181 0.004720 0.015132
4.4 Visualization: Exploration of Signatures and Samples
To explore the results for the TCGA data set, we will use the plotting
functions. All figures are generated with the ggplot2
package, and thus,
their properties and appearances can also be modified at a later stage.
Focusing on the results of the NMF first, the five somatic signatures (named S1 to S5) can be visualized either as a heatmap or as a barchart.
plotSignatureMap(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Heatmap")
plotSignatures(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Barchart")
Each signature represents different properties of the somatic spectrum observed
in the data. While signature S1 is mainly characterized by selective C>T
alterations,
others as S4 and S5 show a broad distribution across the motifs.
In addition, the contribution of the signatures in each study can be represented with the same sets of plots. Signature S1 and S3 are strongly represented in the GBM and SKCM study, respectively. Other signatures show a weaker association with a single cancer type.
plotSampleMap(sigs_nmf)
plotSamples(sigs_nmf)
In the same way as before, the results of the PCA can be visualized. In contrast to the NMF, the signatures also contain negative values, indicating the depletion of a somatic motif.
plotSignatureMap(sigs_pca) + ggtitle("Somatic Signatures: PCA")