Contents

1 Introduction

Mutational processes leave characteristic footprints in genomic DNA. This package provides a comprehensive set of flexible functions that allows researchers to easily evaluate and visualize a multitude of mutational patterns in base substitution catalogues of e.g. healthy samples, tumour samples, or DNA-repair deficient cells. This is the second major version of the package. Many new functions have been added and functions from the previous version have been enhanced. The package covers a wide range of patterns including: mutational signatures, transcriptional and replicative strand bias, lesion segregation, genomic distribution and association with genomic features, which are collectively meaningful for studying the activity of mutational processes. The package works with single nucleotide variants (SNVs), insertions and deletions (Indels), double base substitutions (DBSs) and larger multi base substitutions (MBSs). The package provides functionalities for both extracting mutational signatures de novo and determining the contribution of previously identified mutational signatures on a single sample level. MutationalPatterns integrates with common R genomic analysis workflows and allows easy association with (publicly available) annotation data.

Background on the biological relevance of the different mutational patterns, a practical illustration of the package functionalities, comparison with similar tools and software packages and an elaborate discussion, are described in the new MutationalPatterns article, which is in currently being written. The old article can be found here.

This vignette shows some common ways in which the functions in this package can be used. It is however not exhaustive and won’t show every argument of every function. You can view the documentation of a function by adding a ? in front of it. Like: ?plot_spectrum. The describes the functions and all their arguments in more detail. It also contains more examples of how the functions in this package can be used.

2 Installation

First you need to install the package from Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MutationalPatterns")

You also need to load the package. This needs to be repeated every time you restart R.

library(MutationalPatterns)

3 Data

To perform the mutational pattern analyses, you need to load one or multiple VCF files with variant calls and the corresponding reference genome.

3.1 List reference genome

You can list available genomes using BSgenome:

library(BSgenome)
head(available.genomes())
## [1] "BSgenome.Alyrata.JGI.v1"                 "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"     "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked" "BSgenome.Aofficinalis.NCBI.V1"

Make sure to install the reference genome that matches your VCFs. For the example data this is BSgenome.Hsapiens.UCSC.hg19.

Now you can load your reference genome:

ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)

3.2 Load example data SNVs

We provided two example data sets with this package. One consists of a subset of somatic SNV catalogues of 9 normal human adult stem cells from 3 different tissues (Blokzijl et al. 2016), and the other contains somatic indels and DBSs from 3 healthy human hematopoietic stem cells (Osorio et al. 2018). The MBS data you will find in the latter dataset was manually included by us to demonstrate some features of this package.

This is how you can locate the VCF files of the example data from the first set.

These will be used for the SNV examples:

vcf_files <- list.files(system.file("extdata", package = "MutationalPatterns"),
  pattern = "sample.vcf", full.names = TRUE
)

You also need to define corresponding sample names for the VCF files:

sample_names <- c(
  "colon1", "colon2", "colon3",
  "intestine1", "intestine2", "intestine3",
  "liver1", "liver2", "liver3"
)

Now you can load the VCF files into a GRangesList:

grl <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)
## Warning: replacing previous import 'utils::findMatches' by 'S4Vectors::findMatches' when
## loading 'AnnotationDbi'

Here we define relevant metadata on the samples, such as tissue type. This will be useful later.

tissue <- c(rep("colon", 3), rep("intestine", 3), rep("liver", 3))

3.3 Load example data indels, DBSs and MBSs

We will now locate the VCF files of the example data from the second set. These will be used for the indels, DBS and MBS examples.

blood_vcf_fnames <- list.files(
  system.file("extdata", package = "MutationalPatterns"), 
  pattern = "blood.*vcf", full.names = TRUE)

Set their sample names.

blood_sample_names <- c("blood1", "blood2", "blood3")

Read in the data, without filtering for any mutation type using the type="all" argument. (By default only SNVs are loaded for backwards compatibility.)

blood_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names, 
                                  ref_genome, type = "all")

You can now retrieve different types of mutations from the GrangesList.

snv_grl <- get_mut_type(blood_grl, type = "snv")
## Any neighbouring SNVs will be merged into DBS/MBS variants.
## Set the 'predefined_dbs_mbs' to 'TRUE' if you don't want this.
indel_grl <- get_mut_type(blood_grl, type = "indel")
dbs_grl <- get_mut_type(blood_grl, type = "dbs")
## Any neighbouring SNVs will be merged into DBS/MBS variants.
## Set the 'predefined_dbs_mbs' to 'TRUE' if you don't want this.
mbs_grl <- get_mut_type(blood_grl, type = "mbs")
## Any neighbouring SNVs will be merged into DBS/MBS variants.
## Set the 'predefined_dbs_mbs' to 'TRUE' if you don't want this.

It’s also possible to directly select for a specific mutation type when reading in the data. This can be a convenient shortcut, when you are only interested in a single type of mutation.

indel_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names, 
                                  ref_genome, type = "indel")

By default the package assumes that DBS and MBS variants are present in your vcfs as separate neighbouring SNVs. MutationalPatterns merges these to get DBS and MBS variants. If DBS and MBS variants have already been defined in your vcf or if you don’t want any variants to be merged, then you can use the predefined_dbs_mbs argument, when using read_vcfs_as_granges or get_mut_type. (In this example the result will be empty, because the DBS variants were not predefined)

predefined_dbs_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names, 
                                  ref_genome, type = "dbs",
                                  predefined_dbs_mbs = TRUE)

4 Mutation characteristics

4.1 SNVs

4.1.1 Base substitution types

You can retrieve base substitution types from the VCF GRanges object as “REF>ALT” using mutations_from_vcf:

muts <- mutations_from_vcf(grl[[1]])
head(muts, 12)
##  [1] "A>C" "A>G" "C>T" "A>G" "G>T" "T>A" "T>C" "G>A" "G>A" "C>A" "G>A" "G>T"

You can retrieve the base substitutions from the VCF GRanges object and convert them to the 6 types of base substitution types that are distinguished by convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, when the reference allele is G and the alternative allele is T (G>T), mut_type returns the G:C>T:A mutation as a C>A mutation:

types <- mut_type(grl[[1]])
head(types, 12)
##  [1] "T>G" "T>C" "C>T" "T>C" "C>A" "T>A" "T>C" "C>T" "C>T" "C>A" "C>T" "C>A"

To retrieve the sequence context (one base upstream and one base downstream) of the base substitutions in the VCF object from the reference genome, you can use the mut_context function:

context <- mut_context(grl[[1]], ref_genome)
head(context, 12)
##  chr1  chr1  chr1  chr1  chr1  chr1  chr1  chr1  chr1  chr2  chr2  chr2 
## "CAG" "AAC" "ACA" "AAG" "TGA" "GTT" "ATT" "CGC" "AGC" "ACA" "CGT" "GGA"

Withtype_context, you can retrieve the types and contexts for all positions in the VCF GRanges object. For the base substitutions that are converted to the conventional base substitution types, the reverse complement of the sequence context is returned.

type_context <- type_context(grl[[1]], ref_genome)
lapply(type_context, head, 12)
## $types
##  [1] "T>G" "T>C" "C>T" "T>C" "C>A" "T>A" "T>C" "C>T" "C>T" "C>A" "C>T" "C>A"
## 
## $context
##  chr1  chr1  chr1  chr1  chr1  chr1  chr1  chr1  chr1  chr2  chr2  chr2 
## "CTG" "GTT" "ACA" "CTT" "TCA" "GTT" "ATT" "GCG" "GCT" "ACA" "ACG" "TCC"

With mut_type_occurrences, you can count mutation type occurrences for all VCF objects in the GRangesList. For C>T mutations, a distinction is made between C>T at CpG sites and other sites, as deamination of methylated cytosine at CpG sites is a common mutational process. For this reason, the reference genome is needed for this functionality.

type_occurrences <- mut_type_occurrences(grl, ref_genome)
type_occurrences
##            C>A C>G C>T T>A T>C T>G C>T at CpG C>T other
## colon1      28   5 109  12  30  12         59        50
## colon2      77  29 345  36  90  21        209       136
## colon3      79  19 243  25  61  23        165        78
## intestine1  19   8  74  19  26   4         33        41
## intestine2 118  49 423  57 126  27        258       165
## intestine3  54  27 298  32  67  22        192       106
## liver1      43  22  94  30  77  34         18        76
## liver2     144  93 274 103 209  73         48       226
## liver3      39  28  61  15  32  23          7        54

4.1.2 Mutation spectrum

A mutation spectrum shows the relative contribution of each mutation type in the base substitution catalogs. The plot_spectrum function plots the mean relative contribution of each of the 6 base substitution types over all samples. Error bars indicate the 95% confidence interval over all samples. The total number of mutations is indicated.

p1 <- plot_spectrum(type_occurrences)

You can also plot the mutation spectrum with distinction between C>T at CpG sites and other sites:

p2 <- plot_spectrum(type_occurrences, CT = TRUE)

Other options include plotting the spectrum with the individual samples as points and removing the legend to save space:

p3 <- plot_spectrum(type_occurrences, CT = TRUE, 
                    indv_points = TRUE, legend = FALSE)

Here we use the gridExtra package to combine the created plots, so you can see them next to each other.

library("gridExtra")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(3, 3, 1.75))

It’s also possible to create a facet per sample group, e.g. plot the spectrum for each tissue separately:

p4 <- plot_spectrum(type_occurrences, by = tissue, CT = TRUE, legend = TRUE)

Or you could use the standard deviation instead of a 95% confidence interval:

p5 <- plot_spectrum(type_occurrences, CT = TRUE, 
                    legend = TRUE, error_bars = "stdev")
grid.arrange(p4, p5, ncol = 2, widths = c(4, 2.3))

4.1.3 96 mutational profile

First you should make a 96 trinucleotide mutation count matrix. (In contrast to previous versions this also works for single samples.)

mut_mat <- mut_matrix(vcf_list = grl, ref_genome = ref_genome)
head(mut_mat)
##         colon1 colon2 colon3 intestine1 intestine2 intestine3 liver1 liver2 liver3
## A[C>A]A      3     10     10          5         19          6      8     10      3
## A[C>A]C      0      3      3          1          8          4      1      8      2
## A[C>A]G      2      3      3          1          4          0      1      6      2
## A[C>A]T      0      2      9          0          9          2      2     12      2
## C[C>A]A      1      8      5          0          8          7      2     15      3
## C[C>A]C      2      5      3          1          3          2      1     15      2

Next, you can use this matrix to plot the 96 profile of samples. In this example we do this for 2 samples:

plot_96_profile(mut_mat[, c(1, 7)])

4.1.4 Larger contexts

It’s also possible to look at larger mutational contexts. However, this is only useful if you have a large number of mutations.

mut_mat_ext_context <- mut_matrix(grl, ref_genome, extension = 2)
head(mut_mat_ext_context)
##           colon1 colon2 colon3 intestine1 intestine2 intestine3 liver1 liver2 liver3
## AA[C>A]AA      0      4      1          2          9          4      0      4      0
## AA[C>A]AC      0      0      2          0          1          0      1      0      0
## AA[C>A]AG      0      0      1          0          2          0      1      0      0
## AA[C>A]AT      0      0      1          1          2          0      0      0      0
## AA[C>A]CA      0      0      0          0          0          0      0      2      1
## AA[C>A]CC      0      1      0          0          2          2      0      1      0

The extension argument also works for the mut_context and type_context functions.

You can visualize this matrix with a heatmap.

plot_profile_heatmap(mut_mat_ext_context, by = tissue)