Contents

1 Introduction.

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widley accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.

1.1 Citation

Please cite the below if you find this tool useful for you.

Mayakonda, A. & Koeffler, H.P. Maftools: Efficient analysis, visualization and summarization of MAF files 
from large-scale cohort based cancer studies. bioRxiv (2016). doi: http://dx.doi.org/10.1101/052662

2 MAF field requirements.

MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields.

Complete specififcation of MAF files can be found on NCI TCGA page.

This vignette demonstrates the usage and application of maftools on an example MAF file from TCGA LAML cohort 1.

3 Reading and summarizing maf files.

3.1 Reading MAF files.

read.maf reads MAF files, summarizes it in various ways and stores it as an MAF object.

suppressPackageStartupMessages(require(maftools))
#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE)

3.2 MAF object

Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and an oncomatrix which is useful to plot oncoplots (aka waterfall plots). There are accessor methods to access the useful slots from MAF object. However, all slots can be accessed using @, just like most of S4 objects.

#Typing laml shows basic summary of MAF file.
laml
## An object of class  MAF 
##                    ID          summary       Mean Median
##  1:        NCBI_Build               37         NA     NA
##  2:            Center genome.wustl.edu         NA     NA
##  3:           Samples              192         NA     NA
##  4:            nGenes             1241         NA     NA
##  5:   Frame_Shift_Del               52 0.27083333      0
##  6:   Frame_Shift_Ins               91 0.47395833      0
##  7:      In_Frame_Del               10 0.05208333      0
##  8:      In_Frame_Ins               42 0.21875000      0
##  9: Missense_Mutation             1342 6.98958333      7
## 10: Nonsense_Mutation              103 0.53645833      0
## 11:       Splice_Site               92 0.47916667      0
## 12:             total             1732 9.02083333      9
#Shows sample summry.
getSampleSummary(laml)
##      Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##   1:         TCGA.AB.3009               0               5            0
##   2:         TCGA.AB.2807               1               0            1
##   3:         TCGA.AB.2959               0               0            0
##   4:         TCGA.AB.3002               0               0            0
##   5:         TCGA.AB.2849               0               1            0
##  ---                                                                  
## 188:         TCGA.AB.2933               0               0            0
## 189:         TCGA.AB.2942               0               0            0
## 190:         TCGA.AB.2946               0               0            0
## 191:         TCGA.AB.2954               0               0            0
## 192:         TCGA.AB.2982               0               0            0
##      In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##   1:            1                25                 2           1    34
##   2:            0                16                 3           4    25
##   3:            0                22                 0           1    23
##   4:            0                15                 1           5    21
##   5:            0                16                 1           2    20
##  ---                                                                   
## 188:            0                 1                 0           0     1
## 189:            1                 0                 0           0     1
## 190:            0                 1                 0           0     1
## 191:            0                 1                 0           0     1
## 192:            0                 1                 0           0     1
#Shows frequently mutated genes.
getGeneSummary(laml)
##       Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##    1:        FLT3               0               0            1
##    2:      DNMT3A               4               0            0
##    3:        NPM1               0              33            0
##    4:        IDH2               0               0            0
##    5:        IDH1               0               0            0
##   ---                                                         
## 1237:      ZNF689               0               0            0
## 1238:      ZNF75D               0               0            0
## 1239:      ZNF827               1               0            0
## 1240:       ZNF99               0               0            0
## 1241:        ZPBP               0               0            0
##       In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##    1:           33                15                 0           3    52
##    2:            0                39                 5           6    54
##    3:            0                 1                 0           0    34
##    4:            0                20                 0           0    20
##    5:            0                18                 0           0    18
##   ---                                                                   
## 1237:            0                 1                 0           0     1
## 1238:            0                 1                 0           0     1
## 1239:            0                 0                 0           0     1
## 1240:            0                 1                 0           0     1
## 1241:            0                 1                 0           0     1
##       MutatedSamples
##    1:             52
##    2:             48
##    3:             33
##    4:             20
##    5:             18
##   ---               
## 1237:              1
## 1238:              1
## 1239:              1
## 1240:              1
## 1241:              1
#Shows all fields in MAF
getFields(laml)
##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"        
##  [3] "Center"                 "NCBI_Build"            
##  [5] "Chromosome"             "Start_Position"        
##  [7] "End_Position"           "Strand"                
##  [9] "Variant_Classification" "Variant_Type"          
## [11] "Reference_Allele"       "Tumor_Seq_Allele1"     
## [13] "Tumor_Seq_Allele2"      "Tumor_Sample_Barcode"  
## [15] "Protein_Change"         "i_TumorVAF_WU"         
## [17] "i_transcript_name"
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')

4 Visualization.

4.1 Plotting MAF summary.

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort.

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

4.2 Oncoplots (aka waterfall plots)

4.2.1 Drawing oncoplots.

Bettter representaion of maf file can be shown as oncoplots, also known as waterfall plots. Oncoplot function uses ComplexHeatmap to draw oncoplots2. To be specific, oncoplot is a wrapper around ComplexHeatmap’s OncoPrint function with little modification and automation which makes plotting easier. Side barplot and top barplots can be controlled by drawRowBar and drawColBar arguments respectivelly.

#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml, top = 10, removeNonMutated = TRUE)

NOTE: Variants annotated as Multi_Hit are those genes which are mutated more than once in the same sample.

4.2.2 Including copy number data into oncoplots.

If we have copy number data along with MAF file, we can include them in oncoplot as to show if any genes are amplified or deleted. One most widely used tool for copy number analysis from large scale studies is GISTIC and we can simultaneously read gistic results along with MAF3. GISTIC generates numerous files but we need mainly three files all_lesions.conf_XX.txt, amp_genes.conf_XX.txt and del_genes.conf_XX.txt, where XX is confidence level. These files contain significantly altered genomic regions along with amplified and delted genes respectively.

#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
laml.plus.gistic = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE, gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, isTCGA = TRUE)
#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml.plus.gistic, top = 10, removeNonMutated = TRUE, sortByMutation = TRUE)

This plot shows frequent deletions in TP53 gene which is located on one of the significanlty deleted locus 17p13.2.

4.2.3 Changing colors and adding annotations to oncoplots.

It is often the case that we include meta data to show sample characterstics such as gender, treatment, etc. We can include such meta data by passing them to annotation argument of oncoplot. We can also change colors for Variant_Classification by providing a named vector of colors.

#Read FAB classification of TCGA LAML barcodes.
laml.fab.anno = system.file('extdata', 'tcga_laml_fab_annotation.txt', package = 'maftools')
laml.fab.anno = read.delim(laml.fab.anno, sep = '\t')
head(laml.fab.anno)
##   Tumor_Sample_Barcode FAB_classification
## 1         TCGA-AB-2802                 M4
## 2         TCGA-AB-2803                 M3
## 3         TCGA-AB-2804                 M3
## 4         TCGA-AB-2805                 M0
## 5         TCGA-AB-2806                 M1
## 6         TCGA-AB-2807                 M1
#Changing colors (You can use any colors, here in this example we will use a color palette from RColorBrewer)
col = RColorBrewer::brewer.pal(n = 8, name = 'Paired')
names(col) = c('Frame_Shift_Del','Missense_Mutation', 'Nonsense_Mutation', 'Multi_Hit', 'Frame_Shift_Ins',
               'In_Frame_Ins', 'Splice_Site', 'In_Frame_Del')
#We will plot same top ten mutated genes with FAB classification as annotation and using above defined colors.
oncoplot(maf = laml, top = 10, annotation = laml.fab.anno, removeNonMutated = TRUE, colors = col)

4.3 Oncostrip

We can visualize any set of genes using oncostrip function, which draws mutations in each sample similar to OncoPrinter tool on cBioPortal. oncostrip can be used to draw any number of genes using top or genes arguments.

oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'), removeNonMutated = TRUE, showTumorSampleBarcodes = FALSE)

4.4 Transition and Transversions.

titv function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visulaized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)

4.5 Lollipop plots for amino acid changes.

Lollipop plots are simple and most effective way showing mutation spots on protein structure. Many oncogenes have a preferential sites which are mutated more often than any other locus. These spots are considered to be mutational hotspots and lollipop plots can be used to display them along with rest of the mutations. We can draw such plots using the function lollipopPlot. This fuction requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, lollipopPlot looks for column AAChange, and if its not found in the MAF file, it prints all availble fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name ‘Protein_Change’. We will manually specify this using argumnet AACol. This function also returns the plot as a ggplot object, which user can later modify if needed.

#Lets plot lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
dnmt3a.lpop = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)
## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##      HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629  NP_783328       912
## 2: DNMT3A NM_022552  NP_072046       912
## 3: DNMT3A NM_153759  NP_715640       723
## Using longer transcript NM_175629 for now.
## Removed 3 mutations for which AA position was not available

Note that lollipopPlot warns user on availability of diferent transcripts for the given gene. If we know the transcript id before hand, we can specify it as refSeqID or proteinID. By default lollipopPlot uses the longer isoform.

4.5.1 Labelling and repelling points.

We can also label points on the lollipopPlot using argument labelPos. If labelPos is set to ‘all’, all of the points are highlighted. Sometimes, many mutations are clustered within a range of few amino acid positons. In that case we can use repel option which tries to repel points for clear representation.

#Lets plot mutations on KIT gene, without repel option.
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', domainLabelSize = 3)

#Same plot with repel=TRUE
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', repel = TRUE, domainLabelSize = 3)

4.5.2 cBioPortal style annotations

MutationMapper on cBioPortal collapses Variant Classfications into truncating and others. It also includes somatic mutation rate.

laml.dnmt3a = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', refSeqID = 'NM_175629', labelPos = 882, collapsePosLabel = TRUE, cBioPortal = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)

4.6 Integrating somatic variants and copy number alterations

Many cancer genomic studies involve copy number data generated either from sequencing or SNP chip arrays. Copy number analysis provides us a lot of information from genome wide copy number aberations to tumor purity. Most of the time copy number data is stored as segments, with each segment represented by a log ratio of copy number changes compared to matched normals. There are many segmentation algorithms, with the most popular being Circular Binary Segmentation implemented in DNACopy bioconductor 4. We can plot such segmented copy number data and map all mutations on to it. It provides a quick way of knowing which variants (in a way which genes) are located on copy number altered genomic regions.

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg, maf = laml, labelAll = TRUE)
## Warning: Ignoring unknown aesthetics: label
## Warning: Removed 30 rows containing missing values (geom_segment).

Above plot shows two genes NF1 and SUZ12 are located on a region which has a shallow deletion. Later we will also see how these variants on copy number altered regions affect variant allele frequencies and tumor heterogeneity estimation.

4.7 Rainfall plots

Cancer genomes, especially solid tumors are characterized by genomic loci with localized hypermutations 5. Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using rainfallPlot. If detectChangePoints is set to TRUE, rainfall plot also highlights regions where potential changes in inter-event distances are located. But please be aware that detected change-points are only loci where the distribution of inter-event distance changes. Segments may have to be manually inferred by adjacent change-points. This will be improved in future updates.

coad <- system.file("extdata", "coad.maf.gz", package = "maftools")
coad = read.maf(maf = coad)
coad.rf = rainfallPlot(maf = coad, detectChangePoints = TRUE, fontSize = 12, pointSize = 0.6)
## Change points detected at:
##    Chromosome  Position
## 1:          2 179108701
## 2:          2 179682996
## 3:          5 140161768
## 4:          5 140998034
## 5:         19   8907491
## 6:         19   9065844
## 7:         20   3630019

4.8 Compare mutation load against TCGA cohorts

TCGA contains over 30 different cancer cohrts and median mutation load across them varies from as low as 7 per exome (Pheochromocytoma and Paraganglioma arising from Adrenal Gland) to as high as 315 per exome (Skin Cutaneoys Melanoma). It is informative to see how mutation load in given maf stands against TCGA cohorts. This can can be achieved with the fuction tcgaComapre which draws distribution of variants compiled from over 10,000 WXS samples across 33 TCGA landmark cohorts. Plot generated is similar to the one described in Alexandrov et al 5.

laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')
## Summary..
##           Cohort Cohort_Size Median_Mutations
##  1:         SKCM         468            315.0
##  2:         LUSC         494            187.5
##  3:         LUAD         567            158.0
##  4:         BLCA         412            131.5
##  5:         ESCA         184             85.0
##  6:         HNSC         509             82.0
##  7:         STAD         439             82.0
##  8:         DLBC          48             81.5
##  9:         UCEC         542             78.0
## 10:         COAD         433             76.0
## 11:           OV         443             72.0
## 12:         LIHC         374             67.0
## 13:         CESC         305             66.0
## 14:         READ         158             63.0
## 15:         KIRP         288             53.0
## 16:         KIRC         339             44.0
## 17:          UCS          57             35.0
## 18:         BRCA        1044             34.0
## 19:          GBM         395             32.0
## 20:         SARC         255             31.0
## 21:         CHOL          51             30.0
## 22:         MESO          83             25.0
## 23:         PAAD         178             22.5
## 24:          ACC          92             21.5
## 25:          LGG         511             19.0
## 26:         PRAD         496             19.0
## 27:         KICH          66             13.0
## 28:         TGCT         149             11.0
## 29:         THYM         123             11.0
## 30:         LAML         143             10.0
## 31:          UVM          80             10.0
## 32:         THCA         491              9.0
## 33: Example-LAML         192              9.0
## 34:         PCPG         177              7.0
##           Cohort Cohort_Size Median_Mutations