With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widley accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.
Please cite the below if you find this tool useful for you.
Mayakonda, A. & Koeffler, H.P. Maftools: Efficient analysis, visualization and summarization of MAF files
from large-scale cohort based cancer studies. bioRxiv (2016). doi: http://dx.doi.org/10.1101/052662
MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields.
Mandatoty fields: Hugo_Symbol, Chromosome, Start_Position, End_Position, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.
Recommended optional fields: non MAF specific fields containing vaf and amino acid change information.
Complete specififcation of MAF files can be found on NCI TCGA page.
This vignette demonstrates the usage and application of maftools on an example MAF file from TCGA LAML cohort 1.
read.maf
reads MAF files, summarizes it in various ways and stores it as an MAF object.
suppressPackageStartupMessages(require(maftools))
#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE)
Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and an oncomatrix which is useful to plot oncoplots (aka waterfall plots). There are accessor methods to access the useful slots from MAF object. However, all slots can be accessed using @
, just like most of S4 objects.
#Typing laml shows basic summary of MAF file.
laml
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build 37 NA NA
## 2: Center genome.wustl.edu NA NA
## 3: Samples 192 NA NA
## 4: nGenes 1241 NA NA
## 5: Frame_Shift_Del 52 0.27083333 0
## 6: Frame_Shift_Ins 91 0.47395833 0
## 7: In_Frame_Del 10 0.05208333 0
## 8: In_Frame_Ins 42 0.21875000 0
## 9: Missense_Mutation 1342 6.98958333 7
## 10: Nonsense_Mutation 103 0.53645833 0
## 11: Splice_Site 92 0.47916667 0
## 12: total 1732 9.02083333 9
#Shows sample summry.
getSampleSummary(laml)
## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: TCGA.AB.3009 0 5 0
## 2: TCGA.AB.2807 1 0 1
## 3: TCGA.AB.2959 0 0 0
## 4: TCGA.AB.3002 0 0 0
## 5: TCGA.AB.2849 0 1 0
## ---
## 188: TCGA.AB.2933 0 0 0
## 189: TCGA.AB.2942 0 0 0
## 190: TCGA.AB.2946 0 0 0
## 191: TCGA.AB.2954 0 0 0
## 192: TCGA.AB.2982 0 0 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 1 25 2 1 34
## 2: 0 16 3 4 25
## 3: 0 22 0 1 23
## 4: 0 15 1 5 21
## 5: 0 16 1 2 20
## ---
## 188: 0 1 0 0 1
## 189: 1 0 0 0 1
## 190: 0 1 0 0 1
## 191: 0 1 0 0 1
## 192: 0 1 0 0 1
#Shows frequently mutated genes.
getGeneSummary(laml)
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: FLT3 0 0 1
## 2: DNMT3A 4 0 0
## 3: NPM1 0 33 0
## 4: IDH2 0 0 0
## 5: IDH1 0 0 0
## ---
## 1237: ZNF689 0 0 0
## 1238: ZNF75D 0 0 0
## 1239: ZNF827 1 0 0
## 1240: ZNF99 0 0 0
## 1241: ZPBP 0 0 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 33 15 0 3 52
## 2: 0 39 5 6 54
## 3: 0 1 0 0 34
## 4: 0 20 0 0 20
## 5: 0 18 0 0 18
## ---
## 1237: 0 1 0 0 1
## 1238: 0 1 0 0 1
## 1239: 0 0 0 0 1
## 1240: 0 1 0 0 1
## 1241: 0 1 0 0 1
## MutatedSamples
## 1: 52
## 2: 48
## 3: 33
## 4: 20
## 5: 18
## ---
## 1237: 1
## 1238: 1
## 1239: 1
## 1240: 1
## 1241: 1
#Shows all fields in MAF
getFields(laml)
## [1] "Hugo_Symbol" "Entrez_Gene_Id"
## [3] "Center" "NCBI_Build"
## [5] "Chromosome" "Start_Position"
## [7] "End_Position" "Strand"
## [9] "Variant_Classification" "Variant_Type"
## [11] "Reference_Allele" "Tumor_Seq_Allele1"
## [13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode"
## [15] "Protein_Change" "i_TumorVAF_WU"
## [17] "i_transcript_name"
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')
We can use plotmafSummary
to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort.
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Bettter representaion of maf file can be shown as oncoplots, also known as waterfall plots. Oncoplot function uses ComplexHeatmap to draw oncoplots2. To be specific, oncoplot
is a wrapper around ComplexHeatmap’s OncoPrint
function with little modification and automation which makes plotting easier. Side barplot and top barplots can be controlled by drawRowBar
and drawColBar
arguments respectivelly.
#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml, top = 10, removeNonMutated = TRUE)
NOTE: Variants annotated as Multi_Hit
are those genes which are mutated more than once in the same sample.
If we have copy number data along with MAF file, we can include them in oncoplot as to show if any genes are amplified or deleted. One most widely used tool for copy number analysis from large scale studies is GISTIC and we can simultaneously read gistic results along with MAF3. GISTIC generates numerous files but we need mainly three files all_lesions.conf_XX.txt
, amp_genes.conf_XX.txt
and del_genes.conf_XX.txt
, where XX is confidence level. These files contain significantly altered genomic regions along with amplified and delted genes respectively.
#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
laml.plus.gistic = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE, gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, isTCGA = TRUE)
#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml.plus.gistic, top = 10, removeNonMutated = TRUE, sortByMutation = TRUE)
This plot shows frequent deletions in TP53 gene which is located on one of the significanlty deleted locus 17p13.2.
It is often the case that we include meta data to show sample characterstics such as gender, treatment, etc. We can include such meta data by passing them to annotation
argument of oncoplot
. We can also change colors for Variant_Classification by providing a named vector of colors.
#Read FAB classification of TCGA LAML barcodes.
laml.fab.anno = system.file('extdata', 'tcga_laml_fab_annotation.txt', package = 'maftools')
laml.fab.anno = read.delim(laml.fab.anno, sep = '\t')
head(laml.fab.anno)
## Tumor_Sample_Barcode FAB_classification
## 1 TCGA-AB-2802 M4
## 2 TCGA-AB-2803 M3
## 3 TCGA-AB-2804 M3
## 4 TCGA-AB-2805 M0
## 5 TCGA-AB-2806 M1
## 6 TCGA-AB-2807 M1
#Changing colors (You can use any colors, here in this example we will use a color palette from RColorBrewer)
col = RColorBrewer::brewer.pal(n = 8, name = 'Paired')
names(col) = c('Frame_Shift_Del','Missense_Mutation', 'Nonsense_Mutation', 'Multi_Hit', 'Frame_Shift_Ins',
'In_Frame_Ins', 'Splice_Site', 'In_Frame_Del')
#We will plot same top ten mutated genes with FAB classification as annotation and using above defined colors.
oncoplot(maf = laml, top = 10, annotation = laml.fab.anno, removeNonMutated = TRUE, colors = col)
We can visualize any set of genes using oncostrip
function, which draws mutations in each sample similar to OncoPrinter tool on cBioPortal. oncostrip
can be used to draw any number of genes using top
or genes
arguments.
oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'), removeNonMutated = TRUE, showTumorSampleBarcodes = FALSE)
titv
function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visulaized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)
Lollipop plots are simple and most effective way showing mutation spots on protein structure. Many oncogenes have a preferential sites which are mutated more often than any other locus. These spots are considered to be mutational hotspots and lollipop plots can be used to display them along with rest of the mutations. We can draw such plots using the function lollipopPlot
. This fuction requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, lollipopPlot
looks for column AAChange
, and if its not found in the MAF file, it prints all availble fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name ‘Protein_Change’. We will manually specify this using argumnet AACol
. This function also returns the plot as a ggplot object, which user can later modify if needed.
#Lets plot lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
dnmt3a.lpop = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)
## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629 NP_783328 912
## 2: DNMT3A NM_022552 NP_072046 912
## 3: DNMT3A NM_153759 NP_715640 723
## Using longer transcript NM_175629 for now.
## Removed 3 mutations for which AA position was not available
Note that lollipopPlot
warns user on availability of diferent transcripts for the given gene. If we know the transcript id before hand, we can specify it as refSeqID
or proteinID
. By default lollipopPlot uses the longer isoform.
We can also label points on the lollipopPlot
using argument labelPos
. If labelPos
is set to ‘all’, all of the points are highlighted. Sometimes, many mutations are clustered within a range of few amino acid positons. In that case we can use repel
option which tries to repel points for clear representation.
#Lets plot mutations on KIT gene, without repel option.
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', domainLabelSize = 3)
#Same plot with repel=TRUE
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', repel = TRUE, domainLabelSize = 3)
MutationMapper on cBioPortal collapses Variant Classfications into truncating and others. It also includes somatic mutation rate.
laml.dnmt3a = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', refSeqID = 'NM_175629', labelPos = 882, collapsePosLabel = TRUE, cBioPortal = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)
Many cancer genomic studies involve copy number data generated either from sequencing or SNP chip arrays. Copy number analysis provides us a lot of information from genome wide copy number aberations to tumor purity. Most of the time copy number data is stored as segments, with each segment represented by a log ratio of copy number changes compared to matched normals. There are many segmentation algorithms, with the most popular being Circular Binary Segmentation implemented in DNACopy bioconductor 4. We can plot such segmented copy number data and map all mutations on to it. It provides a quick way of knowing which variants (in a way which genes) are located on copy number altered genomic regions.
tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg, maf = laml, labelAll = TRUE)
## Warning: Ignoring unknown aesthetics: label
## Warning: Removed 30 rows containing missing values (geom_segment).
Above plot shows two genes NF1 and SUZ12 are located on a region which has a shallow deletion. Later we will also see how these variants on copy number altered regions affect variant allele frequencies and tumor heterogeneity estimation.
Cancer genomes, especially solid tumors are characterized by genomic loci with localized hypermutations 5. Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using rainfallPlot
. If detectChangePoints
is set to TRUE, rainfall
plot also highlights regions where potential changes in inter-event distances are located. But please be aware that detected change-points are only loci where the distribution of inter-event distance changes. Segments may have to be manually inferred by adjacent change-points. This will be improved in future updates.
coad <- system.file("extdata", "coad.maf.gz", package = "maftools")
coad = read.maf(maf = coad)
coad.rf = rainfallPlot(maf = coad, detectChangePoints = TRUE, fontSize = 12, pointSize = 0.6)
## Change points detected at:
## Chromosome Position
## 1: 2 179108701
## 2: 2 179682996
## 3: 5 140161768
## 4: 5 140998034
## 5: 19 8907491
## 6: 19 9065844
## 7: 20 3630019
TCGA contains over 30 different cancer cohrts and median mutation load across them varies from as low as 7 per exome (Pheochromocytoma and Paraganglioma arising from Adrenal Gland) to as high as 315 per exome (Skin Cutaneoys Melanoma). It is informative to see how mutation load in given maf stands against TCGA cohorts. This can can be achieved with the fuction tcgaComapre
which draws distribution of variants compiled from over 10,000 WXS samples across 33 TCGA landmark cohorts. Plot generated is similar to the one described in Alexandrov et al 5.
laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')
## Summary..
## Cohort Cohort_Size Median_Mutations
## 1: SKCM 468 315.0
## 2: LUSC 494 187.5
## 3: LUAD 567 158.0
## 4: BLCA 412 131.5
## 5: ESCA 184 85.0
## 6: HNSC 509 82.0
## 7: STAD 439 82.0
## 8: DLBC 48 81.5
## 9: UCEC 542 78.0
## 10: COAD 433 76.0
## 11: OV 443 72.0
## 12: LIHC 374 67.0
## 13: CESC 305 66.0
## 14: READ 158 63.0
## 15: KIRP 288 53.0
## 16: KIRC 339 44.0
## 17: UCS 57 35.0
## 18: BRCA 1044 34.0
## 19: GBM 395 32.0
## 20: SARC 255 31.0
## 21: CHOL 51 30.0
## 22: MESO 83 25.0
## 23: PAAD 178 22.5
## 24: ACC 92 21.5
## 25: LGG 511 19.0
## 26: PRAD 496 19.0
## 27: KICH 66 13.0
## 28: TGCT 149 11.0
## 29: THYM 123 11.0
## 30: LAML 143 10.0
## 31: UVM 80 10.0
## 32: THCA 491 9.0
## 33: Example-LAML 192 9.0
## 34: PCPG 177 7.0
## Cohort Cohort_Size Median_Mutations