Inferring Somatic Signatures from Single Nucleotide Variant Calls

1. Motivation: The Concept Behind Mutational Signatures
2. Methodology: From Mutations to Somatic Signatures
3. Workflow and Implementation: Analysis with the SomaticSignatures Package
4. Use case: Estimating Somatic Signatures from TCGA WES Studies
5. Alternatives: Inferring Somatic Signatures with Different Approaches
6. Frequently Asked Questions
7. References
8. Session Info

Julian Gehring (EMBL Heidelberg)

1 Motivation: The Concept Behind Mutational Signatures

Recent publications introduced the concept of identifying mutational signatures from cancer sequencing studies and linked them to potential mutation generation processes [6] [14] [11]. Conceptually, this relates somatically occurring single nucleotide variants (SNVs) to the surrounding sequence which will be referred to as mutational or somatic motifs in the following. Based on the frequency of the motifs occurring in multiple samples, these can be decomposed mathematically into so called mutational signatures. In case of the investigation of tumors, the term somatic signatures will be used here to distinguish them from germline mutations and their generating processes.

The SomaticSignatures package provides an efficient and user-friendly implementation for the extraction of somatic motifs based on a list of somatically mutated genomic sites and the estimation of somatic signatures with a number of statistical approaches.

2 Methodology: From Mutations to Somatic Signatures

The basic idea of somatic signatures is composed of two parts:

Firstly, each somatic mutation is described in relation of the sequence context in which it occurs. As an example, consider a SNV, resulting in the alteration from A in the normal to G in the tumor sample, that is embedded in the sequence context TAC. Thus, the somatic motif can be written as TAC>TGC or T.C A>G. The frequency of these motifs across multiple samples is then represented as a matrix $M_{ij}$, where $i$ counts over the motifs and $j$ over the samples.

In a second step, the matrix $M$ is numerically decomposed into two matrices $W$ and $H$

$$M_{ij} = \sum_{k=1}^{R} W_{ik} H_{jk}$$

for a fixed number $R$ of signatures. While $W$ describes the composition of each signature in term of somatic motifs, $H$ shows the contribution of the signature to the alterations present in each sample.

3 Workflow and Implementation: Analysis with the SomaticSignatures Package

The SomaticSignatures package offers a framework for inferring signatures of SNVs in a user-friendly and efficient manner for large-scale data sets. A tight integration with standard data representations of the Bioconductor project [1] was a major design goal. Further, it extends the selection of multivariate statistical methods for the matrix decomposition and allows a simple visualization of the results.

For a typical workflow, a set of variant calls and the reference sequence are needed. Ideally, the SNVs are represented as a VRanges object with the genomic location as well as reference and alternative allele defined. The reference sequence can be, for example, a FaFile object, representing an indexed FASTA file, a BSgenome object, or a GmapGenome object. Alternatively, we provide functions to extract the relevant information from other sources of inputs. At the moment, this covers the MuTect [12] variant caller and the h5vc package [15] [9].

Generally, the individual steps of the analysis can be summarized as:

The somatic motifs for each variant are retrieved from the reference sequence with the mutationContext function and converted to a matrix representation with the mutationContextMatrix function.
Somatic signatures are estimated with a method of choice (currently nmfSignatures, pcaSignatures, or kmeansSignatures).
The somatic signatures and their representation in the samples are assessed with a set of accessor and plotting functions.

To decompose $M$, the SomaticSignatures package implements three methods:

Non-negative matrix factorization (NMF): The NMF decomposes $M$ with the constraint of positive components in $W$ and $H$ [4]. The method was used [6] for the identification of mutational signatures, and can be computationally expensive for large data sets.
Principal component analysis (PCA): The PCA employs the eigenvalue decomposition and is therefore suitable for large data sets [2]. While this is related to the NMF, no constraint on the sign of the elements of $W$ and $H$ exists.
K-means clustering: In contrast to the NMF and PCA, the k-means clustering assigns each sample to a single somatic signature, such that $H$ can only contain the value 0 and 1. This approach is suited for large data sets.

4 Use case: Estimating Somatic Signatures from TCGA WES Studies

In the following, the concept of somatic signatures and the steps for inferring these from an actual biological data set are shown. For the example, somatic variant calls from whole exome sequencing (WES) studies from The Cancer Genome Atlas (TCGA) project will be used, which are part of the SomaticCancerAlterations package.

library(SomaticSignatures)

library(GenomicRanges)
library(VariantAnnotation)
library(ggplot2)
library(stringr)

library(SomaticCancerAlterations)
library(BSgenome.Hsapiens.UCSC.hg19)

4.1 Data: Preproccessing of the TCGA WES Studies

The SomaticCancerAlterations package provides the somatic SNV calls for eight WES studies, each investigating a different cancer type. The metadata summarizes the biological and experimental settings of each study.

sca_metadata = scaMetadata()

print(sca_metadata)

             Cancer_Type        Center NCBI_Build Sequence_Source Sequencing_Phase      Sequencer Number_Samples
   gbm_tcga          GBM broad.mit.edu         37             WXS          Phase_I Illumina GAIIx            291
   hnsc_tcga        HNSC broad.mit.edu         37         Capture          Phase_I Illumina GAIIx            319
   kirc_tcga        KIRC broad.mit.edu         37         Capture          Phase_I Illumina GAIIx            297
   luad_tcga        LUAD broad.mit.edu         37             WXS          Phase_I Illumina GAIIx            538
   lusc_tcga        LUSC broad.mit.edu         37             WXS          Phase_I Illumina GAIIx            178
   ov_tcga            OV broad.mit.edu         37             WXS          Phase_I Illumina GAIIx            142
   skcm_tcga        SKCM broad.mit.edu         37         Capture          Phase_I Illumina GAIIx            266
   thca_tcga        THCA broad.mit.edu         37             WXS          Phase_I Illumina GAIIx            406
             Number_Patients                           Cancer_Name
   gbm_tcga              291               Glioblastoma multiforme
   hnsc_tcga             319 Head and Neck squamous cell carcinoma
   kirc_tcga             293                    Kidney Chromophobe
   luad_tcga             519                   Lung adenocarcinoma
   lusc_tcga             178          Lung squamous cell carcinoma
   ov_tcga               142     Ovarian serous cystadenocarcinoma
   skcm_tcga             264               Skin Cutaneous Melanoma
   thca_tcga             403                    Thyroid carcinoma

In this example, all mutational calls of a study will be pooled together, in order to find signatures related to a specific cancer type. The data of all studies is loaded and merged into a single GRanges object, with each entry describing a somatic variant call. Further on, only SNVs located on the human autosomes will be considered. For later analyzes, each variant is also associated with the study it originated from.

sca_all = scaLoadDatasets()

sca_merge = unlist(sca_all)
short_names = str_split_fixed(rownames(sca_metadata), "_", 2)[, 1]
names(sca_merge) = sca_merge$study = factor(rep(short_names, times = elementLengths(sca_all)))

sca_merge = sca_merge[sca_merge$Variant_Type %in% "SNP"]
sca_merge = keepSeqlevels(sca_merge, hsAutosomes())

To get a first impression of the data, we count the number of somatic variants per study and their predicted effects.

sort(table(sca_merge$study), decreasing = TRUE)

   
     luad   skcm   hnsc   lusc   kirc    gbm   thca     ov 
   208724 200589  67125  61485  24158  19938   6716   5872

sort(table(sca_merge$Variant_Classification), decreasing = TRUE)

   
        Missense_Mutation                 Silent      Nonsense_Mutation            Splice_Site                    RNA 
                   377800                 163535                  27299                  13934                  11285 
         Nonstop_Mutation Translation_Start_Site                 Intron                    IGR                  3'UTR 
                      441                    270                     33                      5                      3 
                  5'Flank                  5'UTR        Frame_Shift_Del        Frame_Shift_Ins           In_Frame_Del 
                        1                      1                      0                      0                      0 
             In_Frame_Ins 
                        0

The starting point of the analysis is a VRanges object which describes the somatic variants in terms of their genomic locations as well as reference and alternative alleles. For more details about this class and how to construct it, please see the documentation of the VariantAnnotation package [5]. Since the genomic positions are given in the NCBI notation and the references used later are in UCSC notation, the functions ucsc and ncbi are used to easily switch between the two notations.

sca_vr = VRanges(seqnames(sca_merge), ranges(sca_merge), ref = sca_merge$Reference_Allele, alt = sca_merge$Tumor_Seq_Allele2,
    seqinfo = seqinfo(sca_merge))
sca_vr = ucsc(sca_vr)

head(sca_vr, 3)

   VRanges with 3 ranges and 0 metadata columns:
         seqnames           ranges strand         ref              alt     totalDepth       refDepth       altDepth
                           
     gbm     chr1 [887446, 887446]      +           G                A                                 
     gbm     chr1 [909247, 909247]      +           C                T                                 
     gbm     chr1 [978952, 978952]      +           C                T                                 
           sampleNames softFilterMatrix
                  
     gbm                           
     gbm                           
     gbm                           
     ---
     seqlengths:
           chr1      chr2      chr3      chr4      chr5      chr6 ...     chr18     chr19     chr20     chr21     chr22
      249250621 243199373 198022430 191154276 180915260 171115067 ...  78077248  59128983  63025520  48129895  51304566
     hardFilters: NULL

4.2 Motifs: Extracting the Sequence Context of Somatic Variants

In a first step, the sequence motif for each variant is extracted based on the reference sequence. Here, the BSgenomes object for the human hg19 reference is used. However, all objects with a defined getSeq method can serve as the reference, e.g. an indexed FASTA file. Additionally, we transform all motifs to have a pyrimidine base (C or T) as a reference base [14].

sca_motifs = mutationContext(sca_vr, BSgenome.Hsapiens.UCSC.hg19, unify = TRUE)

sca_motifs$study = sca_merge$study

head(sca_motifs, 3)

   VRanges with 3 ranges and 3 metadata columns:
         seqnames           ranges strand         ref              alt     totalDepth       refDepth       altDepth
                           
     gbm     chr1 [887446, 887446]      +           G                A                                 
     gbm     chr1 [909247, 909247]      +           C                T                                 
     gbm     chr1 [978952, 978952]      +           C                T                                 
           sampleNames softFilterMatrix |     alteration        context    study
                   |   
     gbm                            |             CT            G.G      gbm
     gbm                            |             CT            A.G      gbm
     gbm                            |             CT            G.G      gbm
     ---
     seqlengths:
           chr1      chr2      chr3      chr4      chr5      chr6 ...     chr18     chr19     chr20     chr21     chr22
      249250621 243199373 198022430 191154276 180915260 171115067 ...  78077248  59128983  63025520  48129895  51304566
     hardFilters: NULL

To continue with the estimation of the somatic signatures, the matrix $M$ of the form {motifs × studies} is constructed. The normalize argument specifies that frequencies rather than the actual counts are returned.

sca_occurrence = mutationContextMatrix(sca_motifs, group = "study", normalize = TRUE)

head(round(sca_occurrence, 4))

             gbm   hnsc   kirc   luad   lusc     ov   skcm   thca
   CA A.A 0.0083 0.0098 0.0126 0.0200 0.0165 0.0126 0.0014 0.0077
   CA A.C 0.0093 0.0082 0.0121 0.0217 0.0156 0.0192 0.0009 0.0068
   CA A.G 0.0026 0.0061 0.0046 0.0144 0.0121 0.0060 0.0004 0.0048
   CA A.T 0.0057 0.0051 0.0070 0.0134 0.0100 0.0092 0.0007 0.0067
   CA C.A 0.0075 0.0143 0.0215 0.0414 0.0390 0.0128 0.0060 0.0112
   CA C.C 0.0075 0.0111 0.0138 0.0415 0.0275 0.0143 0.0018 0.0063

The observed occurrence of the motifs, also termed somatic spectrum, can be visualized across studies, which gives a first impression of the data. The distribution of the motifs clearly varies between the studies.

plotSamplesObserved(sca_motifs)

Observed frequency of somatic motifs, also termed somatic spectrum, across studies.

4.3 Decomposition: Inferring Somatic Signatures

The somatic signatures can be estimated with each of the statistical methods implemented in the package. Here, we will use the NMF and PCA, and compare the results. Prior to the estimation, the number $R$ of signatures to obtain has to be fixed; in this example, the data will be decomposed into 5 signatures.

n_sigs = 5

sigs_nmf = nmfSignatures(sca_occurrence, r = n_sigs)

sigs_pca = pcaSignatures(sca_occurrence, r = n_sigs)

The results contains the decomposed matrices stored in a list and can be accessed using standard R accessor functions.

names(sigs_nmf)

   [1] "w"   "h"   "v"   "raw"

sapply(sigs_nmf, dim)

   $w
   [1] 96  5
   
   $h
   [1] 8 5
   
   $v
   [1] 96  8
   
   $raw
   [1] 96  8  5

head(sigs_nmf$w, 3)

                 S1        S2        S3    S4      S5
   CA A.A 5.107e-06 2.375e-16 2.047e-13 1.279 0.52570
   CA A.C 6.276e-02 2.220e-16 2.220e-16 1.320 0.47930
   CA A.G 1.212e-14 1.962e-07 2.792e-03 1.012 0.06564

head(sigs_nmf$h, 3)

              S1       S2       S3       S4       S5
   gbm  0.037329 0.002054 0.002269 0.002543 0.007775
   hnsc 0.008512 0.008109 0.010338 0.006018 0.004097
   kirc 0.008999 0.003468 0.001181 0.004720 0.015132

4.4 Visualization: Exploration of Signatures and Samples

To explore the results for the TCGA data set, we will use the plotting functions. All figures are generated with the ggplot2 package, and thus, their properties and appearances can also be modified at a later stage.

Focusing on the results of the NMF first, the five somatic signatures (named S1 to S5) can be visualized either as a heatmap or as a barchart.

plotSignatureMap(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Heatmap")

Composition of somatic signatures estimated with the NMF, represented as a heatmap.

plotSignatures(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Barchart")

Composition of somatic signatures estimated with the NMF, represented as a barchart.

Each signature represents different properties of the somatic spectrum observed in the data. While signature S1 is mainly characterized by selective C>T alterations, others as S4 and S5 show a broad distribution across the motifs.

In addition, the contribution of the signatures in each study can be represented with the same sets of plots. Signature S1 and S3 are strongly represented in the GBM and SKCM study, respectively. Other signatures show a weaker association with a single cancer type.

plotSampleMap(sigs_nmf)

Occurrence of signatures estimated with the NMF, represented as a heatmap.

plotSamples(sigs_nmf)

Occurrence of signatures estimated with the NMF, represented as a barchart.

In the same way as before, the results of the PCA can be visualized. In contrast to the NMF, the signatures also contain negative values, indicating the depletion of a somatic motif.

plotSignatureMap(sigs_pca) + ggtitle("Somatic Signatures: PCA")

Composition of somatic signatures estimated with the PCA, represented as a heatmap

plotSignatures(sigs_pca)

Composition of somatic signatures estimated with the PCA, represented as a barchart.

plotSampleMap(sigs_pca)

Occurrence of signatures estimated with the PCA, represented as a heatmap.

plotSamples(sigs_pca)

Occurrence of signatures estimated with the PCA, represented as a barchart.

Comparing the results of the two methods, we can see similar characteristics between the sets of signatures, for example S1 of the NMF and S2 of the PCA.

4.5 Extensions: Normalization of Sequence Motif Frequencies and Batch Effects

When investigating somatic signatures between samples from different studies, corrections for technical confounding factors should be considered. In our use case of the TCGA WES studies, this is of minor influence due to similar sequencing technology and variant calling methods across the studies. Approaches for the identification of so termed batch effects have been proposed [3] [8] and could be adapted to the setting of somatic signatures with existing implementations (the sva and leapp packages). While this correction is not performed here, we exemplify the usage by taking the different sequencing technologies of the studies into account.

library(sva)
library(stringr)

df = as(sca_metadata, "data.frame")  ## sample x covariable
pheno = data.frame(s = unlist(df[, "Sequence_Source"]), c = unlist(df[, "Cancer_Type"]))
rownames(pheno) = str_split_fixed(rownames(pheno), "_", 2)[, 1]
mod = model.matrix(~s + c, data = pheno)
mod0 = model.matrix(~c, data = pheno)

sv = sva(sca_occurrence, mod, mod0, method = "irw")

If comparisons are performed across samples or studies with different capture targets, for example by comparing whole exome with whole genome sequencing, further corrections for the frequency of sequence motifs can be taken into account. The kmerFrequency function provides the basis for calculating the occurrence of k-mers over a set of ranges of a reference sequence.

As an example, we compute the frequency of 3-mers for the human chromosome 1, based on a sample of 100'000 locations. Analogously, the k-mer occurrence across the human exome can be obtained easily.

k = 3
n = 1e+05
chrs = "chr1"

chr1_ranges = as(seqinfo(BSgenome.Hsapiens.UCSC.hg19), "GRanges")
chr1_ranges = keepSeqlevels(chr1_ranges, chrs)

k3_chr1 = kmerFrequency(BSgenome.Hsapiens.UCSC.hg19, n, k, chr1_ranges)

k3_chr1

   seq
       AAA     AAC     AAG     AAT     ACA     ACC     ACG     ACT     AGA     AGC     AGG     AGT     ATA     ATC     ATG 
   0.03375 0.01361 0.01778 0.02239 0.01784 0.01029 0.00199 0.01511 0.02069 0.01273 0.01685 0.01493 0.01699 0.01152 0.01547 
       ATT     CAA     CAC     CAG     CAT     CCA     CCC     CCG     CCT     CGA     CGC     CGG     CGT     CTA     CTC 
   0.02121 0.01713 0.01346 0.01949 0.01601 0.01771 0.01254 0.00266 0.01761 0.00209 0.00209 0.00266 0.00231 0.01126 0.01655 
       CTG     CTT     GAA     GAC     GAG     GAT     GCA     GCC     GCG     GCT     GGA     GGC     GGG     GGT     GTA 
   0.01889 0.01805 0.01716 0.00873 0.01596 0.01170 0.01336 0.01116 0.00229 0.01294 0.01404 0.01165 0.01264 0.01113 0.01008 
       GTC     GTG     GTT     TAA     TAC     TAG     TAT     TCA     TCC     TCG     TCT     TGA     TGC     TGG     TGT 
   0.00860 0.01299 0.01304 0.01793 0.01020 0.01150 0.01772 0.01771 0.01483 0.00207 0.01922 0.01789 0.01367 0.01735 0.01777 
       TTA     TTC     TTG     TTT     NNN 
   0.01798 0.01763 0.01681 0.03180 0.09679

With the normalizeMotifs function, the frequency of motifs can be adjusted. Here, we will transform our results of the TCGA WES studies to have the same motif distribution as of a whole-genome analysis. The kmers dataset contains the estimated frequency of 3-mers across the human genome and exome.

head(sca_occurrence)

               gbm     hnsc     kirc    luad    lusc       ov      skcm     thca
   CA A.A 0.008326 0.009818 0.012584 0.01998 0.01648 0.012602 0.0013660 0.007743
   CA A.C 0.009279 0.008179 0.012128 0.02166 0.01558 0.019244 0.0009223 0.006849
   CA A.G 0.002558 0.006078 0.004636 0.01440 0.01205 0.005960 0.0004387 0.004765
   CA A.T 0.005668 0.005050 0.006954 0.01336 0.01002 0.009196 0.0006880 0.006700
   CA C.A 0.007473 0.014257 0.021525 0.04136 0.03900 0.012772 0.0060472 0.011167
   CA C.C 0.007523 0.011084 0.013784 0.04153 0.02754 0.014305 0.0018296 0.006254

data(kmers)
norms = k3wg/k3we
head(norms)

   seq
      AAA    AAC    AAG    AAT    ACA    ACC 
   1.4519 1.1023 1.0248 1.5076 1.1089 0.8911

sca_norm = normalizeMotifs(sca_occurrence, norms)

head(sca_norm)

               gbm     hnsc     kirc     luad     lusc       ov      skcm     thca
   CA A.A 0.009233 0.010887 0.013954 0.022155 0.018270 0.013975 0.0015148 0.008586
   CA A.C 0.008268 0.007288 0.010807 0.019296 0.013884 0.017147 0.0008218 0.006103
   CA A.G 0.001456 0.003459 0.002638 0.008193 0.006858 0.003392 0.0002497 0.002711
   CA A.T 0.006232 0.005554 0.007647 0.014694 0.011017 0.010113 0.0007565 0.007368
   CA C.A 0.006350 0.012114 0.018289 0.035144 0.033139 0.010853 0.0051382 0.009489
   CA C.C 0.005232 0.007707 0.009585 0.028882 0.019148 0.009948 0.0012723 0.004349

5 Alternatives: Inferring Somatic Signatures with Different Approaches

For the identification of somatic signatures, other methods and implementations exist. The original framework [6] [11] proposed for this is based on the NMF and available for the Matlab programming language [7]. In extension, a probabilistic approach based on Poisson processes has been proposed [13] and implemented [10].

6 Frequently Asked Questions

6.1 Getting help

We welcome emails with questions or suggestions about our software, and want to ensure that we eliminate issues if and when they appear. We have a few requests to optimize the process:

All emails and follow-up questions should take place over the Bioconductor mailing list, which serves as a repository of information.
The subject line should contain SomaticSignatures and a few words describing the problem. First search the Bioconductor mailing list, for past threads which might have answered your question.
If you have a question about the behavior of a function, read the sections of the manual page for this function by typing a question mark and the function name, e.g. ?mutationContext. Additionally, read through the vignette to understand the interplay between different functions of the package. We spend a lot of time documenting individual functions and the exact steps that the software is performing.
Include all of your R code related to the question you are asking.
Include complete warning or error messages, and conclude your message with the full output of sessionInfo().

6.2 Installing the package

Before you want to install the SomaticSignatures package, please ensure that you have the latest version of R and Bioconductor installed. For details on this, please have a look at the help packages for R and Bioconductor. Then you can open R and run the following commands which will install the latest release version of SomaticSignatures:

source("http://bioconductor.org/biocLite.R")
biocLite("SomaticSignatures")

6.3 Working with VRanges

A central object in the workflow of SomaticSignatures is the VRanges class which is part of the VariantAnnotation package. It builds upon the commonly used GRanges class of the GenomicRanges package. Essentially, each row represents a variant in terms of its genomic location as well as its reference and alternative allele.

library(VariantAnnotation)

There are multiple ways of converting its own variant calls into a VRanges object. One can for example import them from a VCF file with the readVcf function or employ the readMutect function for importing variant calls from the MuTect caller directly. Further, one can also construct it from any other format in the form of:

vr = VRanges(seqnames = "chr1", ranges = IRanges(start = 1000, width = 1), ref = "A", alt = "C")

vr

   VRanges with 1 range and 0 metadata columns:
         seqnames       ranges strand         ref              alt     totalDepth       refDepth       altDepth
                       
     [1]     chr1 [1000, 1000]      +           A                C                                 
           sampleNames softFilterMatrix
                  
     [1]                           
     ---
     seqlengths:
      chr1
        NA
     hardFilters: NULL

7 References

[1]	Robert C. Gentleman, Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J. Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean YH Yang, and Jianhua Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5(10):R80, September 2004. PMID: 15461798. [ DOI \| http ]
[2]	Wolfram Stacklies, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. pcaMethodsa bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23(9):1164-1167, May 2007. PMID: 17344241. [ DOI \| http ]
[3]	Jeffrey T Leek and John D Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet, 3(9):e161, September 2007. [ DOI \| http ]
[4]	Renaud Gaujoux and Cathal Seoighe. A flexible r package for nonnegative matrix factorization. BMC Bioinformatics, 11(1):367, July 2010. PMID: 20598126. [ DOI \| http ]
[5]	Valerie Obenchain, Martin Morgan, and Michael Lawrence. VariantAnnotation: annotation of genetic variants, 2011. [ .html ]
[6]	Serena Nik-Zainal, Ludmil B. Alexandrov, David C. Wedge, Peter Van Loo, Christopher D. Greenman, Keiran Raine, David Jones, Jonathan Hinton, John Marshall, Lucy A. Stebbings, Andrew Menzies, Sancha Martin, Kenric Leung, Lina Chen, Catherine Leroy, Manasa Ramakrishna, Richard Rance, King Wai Lau, Laura J. Mudie, Ignacio Varela, David J. McBride, Graham R. Bignell, Susanna L. Cooke, Adam Shlien, John Gamble, Ian Whitmore, Mark Maddison, Patrick S. Tarpey, Helen R. Davies, Elli Papaemmanuil, Philip J. Stephens, Stuart McLaren, Adam P. Butler, Jon W. Teague, Göran Jönsson, Judy E. Garber, Daniel Silver, Penelope Miron, Aquila Fatima, Sandrine Boyault, Anita Langerød, Andrew Tutt, John W.M. Martens, Samuel A.J.R. Aparicio, Ake Borg, Anne Vincent Salomon, Gilles Thomas, Anne-Lise Børresen-Dale, Andrea L. Richardson, Michael S. Neuberger, P. Andrew Futreal, Peter J. Campbell, and Michael R. Stratton. Mutational processes molding the genomes of 21 breast cancers. Cell, 149(5):979-993, May 2012. [ DOI \| http ]
[7]	Ludmil Alexandrov. WTSI mutational signature framework, October 2012. [ http ]
[8]	Yunting Sun, Nancy R. Zhang, and Art B. Owen. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. The Annals of Applied Statistics, 6(4):1664-1688, December 2012. Zentralblatt MATH identifier: 06141543; Mathematical Reviews number (MathSciNet): MR3058679. [ DOI \| http ]
[9]	Paul Theodor Pyl. h5vc: Managing alignment tallies using a hdf5 backend, 2013. [ .html ]
[10]	Andrej Fischer. EMu: expectation-maximisation inference of mutational signatures, 2013. [ http ]
[11]	Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Peter J. Campbell, and Michael R. Stratton. Deciphering signatures of mutational processes operative in human cancer. Cell Reports, 3(1):246-259, January 2013. [ DOI \| http ]
[12]	Kristian Cibulskis, Michael S. Lawrence, Scott L. Carter, Andrey Sivachenko, David Jaffe, Carrie Sougnez, Stacey Gabriel, Matthew Meyerson, Eric S. Lander, and Gad Getz. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, advance online publication, February 2013. [ DOI \| .html ]
[13]	Andrej Fischer, Christopher Jr Illingworth, Peter J Campbell, and Ville Mustonen. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome biology, 14(4):R39, April 2013. PMID: 23628380. [ DOI ]
[14]	Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Samuel A. J. R. Aparicio, Sam Behjati, Andrew V. Biankin, Graham R. Bignell, Niccolò Bolli, Ake Borg, Anne-Lise Børresen-Dale, Sandrine Boyault, Birgit Burkhardt, Adam P. Butler, Carlos Caldas, Helen R. Davies, Christine Desmedt, Roland Eils, Jórunn Erla Eyfjörd, John A. Foekens, Mel Greaves, Fumie Hosoda, Barbara Hutter, Tomislav Ilicic, Sandrine Imbeaud, Marcin Imielinsk, Natalie Jäger, David T. W. Jones, David Jones, Stian Knappskog, Marcel Kool, Sunil R. Lakhani, Carlos López-Otín, Sancha Martin, Nikhil C. Munshi, Hiromi Nakamura, Paul A. Northcott, Marina Pajic, Elli Papaemmanuil, Angelo Paradiso, John V. Pearson, Xose S. Puente, Keiran Raine, Manasa Ramakrishna, Andrea L. Richardson, Julia Richter, Philip Rosenstiel, Matthias Schlesner, Ton N. Schumacher, Paul N. Span, Jon W. Teague, Yasushi Totoki, Andrew N. J. Tutt, Rafael Valdés-Mas, Marit M. van Buuren, Laura van t Veer, Anne Vincent-Salomon, Nicola Waddell, Lucy R. Yates, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-Seq Consortium, Icgc PedBrain, Jessica Zucman-Rossi, P. Andrew Futreal, Ultan McDermott, Peter Lichter, Matthew Meyerson, Sean M. Grimmond, Reiner Siebert, Elías Campo, Tatsuhiro Shibata, Stefan M. Pfister, Peter J. Campbell, and Michael R. Stratton. Signatures of mutational processes in human cancer. Nature, 500(7463):415-421, August 2013. [ DOI \| .html ]
[15]	Paul Theodor Pyl, Julian Gehring, Bernd Fischer, and Wolfgang Huber. h5vc: Scalable nucleotide tallies with HDF5. Bioinformatics, page btu026, January 2014. PMID: 24451629. [ DOI \| http ]

8 Session Info

   R version 3.1.0 (2014-04-10)
   Platform: x86_64-unknown-linux-gnu (64-bit)
   
   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=C              
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
    [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
   
   attached base packages:
   [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
    [1] fastICA_1.2-0                      BSgenome.Hsapiens.UCSC.hg19_1.3.99 BSgenome_1.32.0                   
    [4] SomaticCancerAlterations_1.0.0     stringr_0.6.2                      ggplot2_0.9.3.1                   
    [7] VariantAnnotation_1.10.1           Rsamtools_1.16.0                   Biostrings_2.32.0                 
   [10] XVector_0.4.0                      GenomicRanges_1.16.3               GenomeInfoDb_1.0.2                
   [13] IRanges_1.22.6                     SomaticSignatures_1.0.1            bigmemory_4.4.6                   
   [16] BH_1.54.0-2                        bigmemory.sri_0.1.2                Biobase_2.24.0                    
   [19] BiocGenerics_0.10.0                knitr_1.5                         
   
   loaded via a namespace (and not attached):
    [1] AnnotationDbi_1.26.0    BBmisc_1.6              BatchJobs_1.2           BiocParallel_0.6.0     
    [5] DBI_0.2-7               Formula_1.1-1           GenomicAlignments_1.0.1 GenomicFeatures_1.16.0 
    [9] Hmisc_3.14-4            MASS_7.3-33             NMF_0.20.5              RColorBrewer_1.0-5     
   [13] RCurl_1.95-4.1          RSQLite_0.11.4          Rcpp_0.11.1             XML_3.98-1.1           
   [17] biomaRt_2.20.0          biovizBase_1.12.1       bitops_1.0-6            brew_1.0-6             
   [21] cluster_1.15.2          codetools_0.2-8         colorspace_1.2-4        dichromat_2.0-0        
   [25] digest_0.6.4            doParallel_1.0.8        evaluate_0.5.5          exomeCopy_1.10.0       
   [29] fail_1.2                foreach_1.4.2           formatR_0.10            ggbio_1.12.3           
   [33] grid_3.1.0              gridBase_0.4-7          gridExtra_0.9.1         gtable_0.1.2           
   [37] gtools_3.4.0            highr_0.3               iterators_1.0.7         labeling_0.2           
   [41] lattice_0.20-29         latticeExtra_0.6-26     markdown_0.7            mime_0.1.1             
   [45] munsell_0.4.2           pcaMethods_1.54.0       pkgmaker_0.20           plyr_1.8.1             
   [49] proto_0.3-10            registry_0.2            reshape2_1.4            rngtools_1.2.4         
   [53] rtracklayer_1.24.0      scales_0.2.4            sendmailR_1.1-2         splines_3.1.0          
   [57] stats4_3.1.0            survival_2.37-7         tools_3.1.0             xtable_1.7-3           
   [61] zlibbioc_1.10.0