1 Introduction

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.

1.1 Citation

Please cite the below if you find this tool useful for you.

Mayakonda, A. & Koeffler, H.P. Maftools: Efficient analysis, visualization and summarization of MAF files 
from large-scale cohort based cancer studies. bioRxiv (2016). doi: http://dx.doi.org/10.1101/052662

2 MAF field requirements

MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields.

Mandatory fields: Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.
Recommended optional fields: non MAF specific fields containing vaf and amino acid change information.

Complete specification of MAF files can be found on NCI TCGA page.

This vignette demonstrates the usage and application of maftools on an example MAF file from TCGA LAML cohort 1.

3 Overview of the package

maftools functions can be categorized into mainly Visualization and Analysis modules. Each of these functions and a short description is summarized as shown below. Usage is simple, just read your MAF file with read.maf (along with copy-number data available) and pass the resulting MAF object to the desired function for plotting or analysis.

4 Reading and summarizing maf files

4.1 Required input files

an MAF file - can be gz compressed. Required.
an optional but recommended clinical data associated with each sample/Tumor_Sample_Barcode in MAF.
an optional copy number data if available. Can be GISTIC output or a custom table containing sample names, gene names and copy-number status (Amp or Del).

4.2 Reading MAF files.

read.maf function reads MAF files, summarizes it in various ways and stores it as an MAF object. Even though MAF file is alone enough, it is highly recommended to provide annotations associated with samples in MAF. One can also integrate copy number data if available.

require(maftools)

laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') #path to TCGA LAML MAF file
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools') # clinical information containing survival information and histology. This is optional

laml = read.maf(maf = laml.maf, clinicalData = laml.clin)

4.3 MAF object

Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and any associated sample annotations.

There are accessor methods to access the useful slots from MAF object.

#Typing laml shows basic summary of MAF file.
laml

## An object of class  MAF 
##                    ID          summary  Mean Median
##  1:        NCBI_Build               37    NA     NA
##  2:            Center genome.wustl.edu    NA     NA
##  3:           Samples              192    NA     NA
##  4:            nGenes             1241    NA     NA
##  5:   Frame_Shift_Del               52 0.271      0
##  6:   Frame_Shift_Ins               91 0.474      0
##  7:      In_Frame_Del               10 0.052      0
##  8:      In_Frame_Ins               42 0.219      0
##  9: Missense_Mutation             1342 6.990      7
## 10: Nonsense_Mutation              103 0.536      0
## 11:       Splice_Site               92 0.479      0
## 12:             total             1732 9.021      9

#Shows sample summry.
getSampleSummary(laml)

##      Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##   1:         TCGA-AB-3009               0               5            0
##   2:         TCGA-AB-2807               1               0            1
##   3:         TCGA-AB-2959               0               0            0
##   4:         TCGA-AB-3002               0               0            0
##   5:         TCGA-AB-2849               0               1            0
##  ---                                                                  
## 188:         TCGA-AB-2933               0               0            0
## 189:         TCGA-AB-2942               0               0            0
## 190:         TCGA-AB-2946               0               0            0
## 191:         TCGA-AB-2954               0               0            0
## 192:         TCGA-AB-2982               0               0            0
##      In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##   1:            1                25                 2           1    34
##   2:            0                16                 3           4    25
##   3:            0                22                 0           1    23
##   4:            0                15                 1           5    21
##   5:            0                16                 1           2    20
##  ---                                                                   
## 188:            0                 1                 0           0     1
## 189:            1                 0                 0           0     1
## 190:            0                 1                 0           0     1
## 191:            0                 1                 0           0     1
## 192:            0                 1                 0           0     1

#Shows gene summary.
getGeneSummary(laml)

##       Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##    1:        FLT3               0               0            1
##    2:      DNMT3A               4               0            0
##    3:        NPM1               0              33            0
##    4:        IDH2               0               0            0
##    5:        IDH1               0               0            0
##   ---                                                         
## 1237:      ZNF689               0               0            0
## 1238:      ZNF75D               0               0            0
## 1239:      ZNF827               1               0            0
## 1240:       ZNF99               0               0            0
## 1241:        ZPBP               0               0            0
##       In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##    1:           33                15                 0           3    52
##    2:            0                39                 5           6    54
##    3:            0                 1                 0           0    34
##    4:            0                20                 0           0    20
##    5:            0                18                 0           0    18
##   ---                                                                   
## 1237:            0                 1                 0           0     1
## 1238:            0                 1                 0           0     1
## 1239:            0                 0                 0           0     1
## 1240:            0                 1                 0           0     1
## 1241:            0                 1                 0           0     1
##       MutatedSamples AlteredSamples
##    1:             52             52
##    2:             48             48
##    3:             33             33
##    4:             20             20
##    5:             18             18
##   ---                              
## 1237:              1              1
## 1238:              1              1
## 1239:              1              1
## 1240:              1              1
## 1241:              1              1

#shows clinical data associated with samples
getClinicalData(laml)

##      Tumor_Sample_Barcode FAB_classification days_to_last_followup
##   1:         TCGA-AB-2802                 M4                   365
##   2:         TCGA-AB-2803                 M3                   792
##   3:         TCGA-AB-2804                 M3                  2557
##   4:         TCGA-AB-2805                 M0                   577
##   5:         TCGA-AB-2806                 M1                   945
##  ---                                                              
## 188:         TCGA-AB-3007                 M3                  1581
## 189:         TCGA-AB-3008                 M1                   822
## 190:         TCGA-AB-3009                 M4                   577
## 191:         TCGA-AB-3011                 M1                  1885
## 192:         TCGA-AB-3012                 M3                  1887
##      Overall_Survival_Status
##   1:                       1
##   2:                       1
##   3:                       0
##   4:                       1
##   5:                       1
##  ---                        
## 188:                       0
## 189:                       1
## 190:                       1
## 191:                       0
## 192:                       0

#Shows all fields in MAF
getFields(laml)

##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"        
##  [3] "Center"                 "NCBI_Build"            
##  [5] "Chromosome"             "Start_Position"        
##  [7] "End_Position"           "Strand"                
##  [9] "Variant_Classification" "Variant_Type"          
## [11] "Reference_Allele"       "Tumor_Seq_Allele1"     
## [13] "Tumor_Seq_Allele2"      "Tumor_Sample_Barcode"  
## [15] "Protein_Change"         "i_TumorVAF_WU"         
## [17] "i_transcript_name"

#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')

5 Visualization

5.1 Plotting MAF summary.

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort.

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

5.2 Oncoplots

5.2.1 Drawing oncoplots

Better representation of maf file can be shown as oncoplots, also known as waterfall plots. Oncoplot function uses “ComplexHeatmap” to draw oncoplots2. To be specific, oncoplot is a wrapper around ComplexHeatmap’s OncoPrint function with little modification and automation which makes plotting easier. Side barplot and top barplots can be controlled by drawRowBar and drawColBar arguments respectively.

#We will draw oncoplots for top ten mutated genes.
oncoplot(maf = laml, top = 10, fontSize = 12)

NOTE: Variants annotated as Multi_Hit are those genes which are mutated more than once in the same sample.

5.2.2 Including copy number data into oncoplots.

If we have copy number data along with MAF file, we can include them in oncoplot as to show if any genes are amplified or deleted. One most widely used tool for copy number analysis from large scale studies is GISTIC and we can simultaneously read gistic results along with MAF3. GISTIC generates numerous files but we need mainly four files all_lesions.conf_XX.txt, amp_genes.conf_XX.txt, del_genes.conf_XX.txt, scores.gistic where XX is confidence level. These files contain significantly altered genomic regions along with amplified and deleted genes respectively.

#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")
laml.plus.gistic = read.maf(maf = laml.maf, gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)

#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml.plus.gistic, top = 10, fontSize = 12)

This plot shows frequent deletions in TP53 gene which is located on one of the significantly deleted locus 17p13.2.

5.2.3 Changing colors, including MutSig Q-Values, adding annotations and sorting.

Oncoplots can be further improved by including annotations (clinical features) associated with samples, changing colors for variant classifications and including q-values of significance (either generated from MutSig or similar program). Below plot includes,

list of significantly mutated genes (SMG) and their q-values as a side barplot,
FAB classification of each sample as bottom annotations.
custom color coding for variant classification.

One can use SMG list generated from any progrm as long as it contains two required columns titled gene and q

#Changing colors for variant classifications (You can use any colors, here in this example we will use a color palette from RColorBrewer)
col = RColorBrewer::brewer.pal(n = 8, name = 'Paired')
names(col) = c('Frame_Shift_Del','Missense_Mutation', 'Nonsense_Mutation', 'Multi_Hit', 'Frame_Shift_Ins',
               'In_Frame_Ins', 'Splice_Site', 'In_Frame_Del')

#Color coding for FAB classification; try getAnnotations(x = laml) to see available annotations.
fabcolors = RColorBrewer::brewer.pal(n = 8,name = 'Spectral')
names(fabcolors) = c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7")
fabcolors = list(FAB_classification = fabcolors)

#MutSig reusults
laml.mutsig <- system.file("extdata", "LAML_sig_genes.txt.gz", package = "maftools")

oncoplot(maf = laml, colors = col, mutsig = laml.mutsig, mutsigQval = 0.01, clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, annotationColor = fabcolors)

Also note that using sortByAnnotation = TRUE further sorts oncoplot according to annotations provided (here FAB_classification) and helps in clear representation of mutational patters within subtypes. For example above plot shows NPM1 is rarely mutated in M0 type of AML whereas RUNX1 is virtually absent in M5 subtype.

5.3 Oncostrip

We can visualize any set of genes using oncostrip function, which draws mutations in each sample similar to OncoPrinter tool on cBioPortal. oncostrip can be used to draw any number of genes using top or genes arguments.

oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'))

5.4 Transition and Transversions.

titv function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visualized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)

5.5 Lollipop plots for amino acid changes

Lollipop plots are simple and most effective way showing mutation spots on protein structure. Many oncogenes have a preferential sites which are mutated more often than any other locus. These spots are considered to be mutational hot-spots and lollipop plots can be used to display them along with rest of the mutations. We can draw such plots using the function lollipopPlot. This function requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, lollipopPlot looks for column AAChange, and if its not found in the MAF file, it prints all available fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name ‘Protein_Change’. We will manually specify this using argument AACol. This function also returns the plot as a ggplot object, which user can later modify if needed.

#Lets plot lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
dnmt3a.lpop = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)

## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.

##      HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629  NP_783328       912
## 2: DNMT3A NM_022552  NP_072046       912
## 3: DNMT3A NM_153759  NP_715640       723

## Using longer transcript NM_175629 for now.

## Removed 3 mutations for which AA position was not available

Note that lollipopPlot warns user on availability of different transcripts for the given gene. If we know the transcript id before hand, we can specify it as refSeqID or proteinID. By default lollipopPlot uses the longer isoform.

5.5.1 Labelling points.

We can also label points on the lollipopPlot using argument labelPos. If labelPos is set to ‘all’, all of the points are highlighted.

#Lets plot mutations on KIT gene, without repel option.
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', domainLabelSize = 3, labPosSize = 3)

5.5.2 cBioPortal style annotations

MutationMapper on cBioPortal collapses Variant Classifications into truncating and others. It also includes somatic mutation rate.

laml.dnmt3a = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', refSeqID = 'NM_175629', labelPos = 882, collapsePosLabel = TRUE, cBioPortal = TRUE, domainLabelSize = 3, defaultYaxis = FALSE, labPosSize = 3)

5.6 Integrating somatic variants and copy number alterations

Many cancer genomic studies involve copy number data generated either from sequencing or SNP chip arrays. Copy number analysis provides us a lot of information from genome wide copy number aberrations to tumor purity. Most of the time copy number data is stored as segments, with each segment represented by a log ratio of copy number changes compared to matched normals. There are many segmentation algorithms, with the most popular being Circular Binary Segmentation implemented in “DNACopy” 4. We can plot such segmented copy number data and map all mutations on to it. It provides a quick way of knowing which variants (in a way which genes) are located on copy number altered genomic regions.

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg, maf = laml, labelAll = TRUE)

## Warning: Ignoring unknown aesthetics: label

## Warning: Removed 30 rows containing missing values (geom_segment).

Above plot shows two genes NF1 and SUZ12 are located on a region which has a shallow deletion. Later we will also see how these variants on copy number altered regions affect variant allele frequencies and tumor heterogeneity estimation.

5.7 Rainfall plots

Cancer genomes, especially solid tumors are characterized by genomic loci with localized hyper-mutations 5. Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using rainfallPlot. If detectChangePoints is set to TRUE, rainfall plot also highlights regions where potential changes in inter-event distances are located.

coad <- system.file("extdata", "coad.maf.gz", package = "maftools")
coad = read.maf(maf = coad)

 rainfallPlot(maf = coad, detectChangePoints = TRUE, fontSize = 12, pointSize = 0.6)

## Processing TCGA.AG.A002.01..

## Change points detected at:

##    Chromosome Start_Position End_Position nMuts Avg_intermutation_dist
## 1:          1      150459281    150595840    19                7586.61
## 2:          2      179106225    179359080    85                3010.18
## 3:          5      140161299    140845781    92                7521.78
## 4:         13       22803110     22813071    14                 766.23
## 5:         19        8907340      8950788    28                1609.19
##      Size C>A C>T T>C C>G T>A T>G Filter
## 1: 136559   8   9   2  NA  NA  NA      .
## 2: 252855  31  37   6   1   3   7      .
## 3: 684482  32  46   6   1   1   6      .
## 4:   9961   6   2   1  NA  NA   5   PASS
## 5:  43448  15   8   3  NA  NA   2      .

Filter column in the above results indicate whether the identified segment passes the definition of “Kataegis” which are defined as those genomic segments containing six or more consecutive mutations with an average inter-mutation distance of less than or equal to 1,00 bp 5.

5.8 Compare mutation load against TCGA cohorts

TCGA contains over 30 different cancer cohorts and median mutation load across them varies from as low as 7 per exome (Pheochromocytoma and Paraganglioma arising from Adrenal Gland) to as high as 315 per exome (Skin Cutaneoys Melanoma). It is informative to see how mutation load in given maf stands against TCGA cohorts. This can can be achieved with the function tcgaComapre which draws distribution of variants compiled from over 10,000 WXS samples across 33 TCGA landmark cohorts. Plot generated is similar to the one described in Alexandrov et al 5.

laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')

5.9 Plotting VAF

This function plots Variant Allele Frequencies as a boxplot which quickly helps to estimate clonal status of top mutated genes (clonal genes usually have mean allele frequency around ~50% assuming pure sample)

vafPlot = plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU', flip = FALSE)

## Warning: Removed 7 rows containing missing values (geom_point).

5.10 Genecloud

We can plot word cloud plot for mutated genes with the function geneCloud. Size of each gene is proportional to the total number of samples in which it is mutated/altered.

geneCloud(input = laml, minMut = 3)

6 Processing copy-number data

6.1 Reading and summarizing gistic output files.

We can summarize output files generated by GISTIC programme. As mentioned earlier, we need four files that are generated by GISTIC, i.e, all_lesions.conf_XX.txt, amp_genes.conf_XX.txt, del_genes.conf_XX.txt and scores.gistic, where XX is the confidence level. See GISTIC documentation for details.

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")

laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)

## Processing Gistic files..

## Processing amp_genes.conf_99.txt..

## Processing del_genes.conf_99.txt..

## Processing scores.gistic..

## Summarizing samples..

#GISTIC object
laml.gistic

## An object of class  GISTIC 
##           ID summary
## 1:   Samples      66
## 2:    nGenes    2622
## 3: cytoBands      16
## 4:       Amp     388
## 5:       Del   26481
## 6:     total   26869

Similar to MAF objects, there are methods available to access slots of GISTIC object - getSampleSummary, getGeneSummary and getCytoBandSummary. Summarized results can be written to output files using function write.GisticSummary.

6.2 Visualizing gistic results.

There are three types of plots available to visualize gistic results.

6.2.1 genome plot

gcp = gisticChromPlot(gistic = laml.gistic, markBands = "all")

## Warning: Ignoring unknown aesthetics: label

6.2.2 Bubble plot

gbp = gisticBubblePlot(gistic = laml.gistic)

6.2.3 oncoplot

This is similar to oncoplots except for copy number data. One can again sort the matrix according to annotations, if any. Below plot is the gistic results for LAML, sorted according to FAB classification. Plot shows that 7q deletions are virtually absent in M4 subtype where as it is widespread in other subtypes.

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)

7 Analysis

7.1 Somatic Interactions

Many disease causing genes in cancer are co-occurring or show strong exclusiveness in their mutation pattern. Such mutually exclusive or co-occurring set of genes can be detected using somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes. somaticInteractions function also uses cometExactTest to identify potentially altered gene sets involving >2 two genes 6.

#We will run mutExclusive on top 10 mutated genes. 
somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))

##               gene_set       pvalue
## 1:   TP53, FLT3, RUNX1 4.830385e-05
## 2:    TP53, IDH2, FLT3 8.596593e-05
## 3: TP53, RUNX1, DNMT3A 5.109811e-03

## $pairs
##      gene1  gene2       pValue  oddsRatio  00 11 01 10              Event
##  1:  RUNX1  ASXL1 0.0001573764 54.9015813 175  4  1 12       Co_Occurance
##  2:  RUNX1   IDH2 0.0002906170  9.5325828 163  7 13  9       Co_Occurance
##  3:   IDH2  ASXL1 0.0004114350 40.8398240 171  4  1 16       Co_Occurance
##  4:   FLT3   NPM1 0.0010412734  3.7331578 124 17 16 35       Co_Occurance
##  5:   SMC3 DNMT3A 0.0010773516 20.0377249 143  6 42  1       Co_Occurance
##  6:   NPM1 DNMT3A 0.0015025851  3.7040763 127 16 32 17       Co_Occurance
##  7: DNMT3A   IDH1 0.0035284498  4.4297316 136 10  8 38       Co_Occurance
##  8:    TTN  ASXL1 0.0078402697 28.3054161 183  2  3  4       Co_Occurance
##  9:  RUNX1   PHF6 0.0082268971 12.8927337 173  3  3 13       Co_Occurance
## 10:  RUNX1    TTN 0.0082268971 12.8927337 173  3  3 13       Co_Occurance
## 11:   TP53   FLT3 0.0124927250  0.0000000 125 NA 52 15 Mutually_Exclusive
## 12:  STAG2 PTPN11 0.0266587778 12.3227338 179  2  7  4       Co_Occurance
## 13:   IDH2   NPM1 0.0276230694  0.0000000 139 NA 33 20 Mutually_Exclusive
## 14:   IDH2   KRAS 0.0387926319  5.7982788 167  3  5 17       Co_Occurance
## 15:    WT1   PHF6 0.0468111161  8.5748381 176  2  4 10       Co_Occurance
## 16:   NPM1 PTPN11 0.0487632619  4.2029330 154  4  5 29       Co_Occurance
## 17:   IDH2  PLCE1 0.0545842683  9.2259294 170  2  2 18       Co_Occurance
## 18:   NPM1  SMC1A 0.0643753304  5.1344612 156  3  3 30       Co_Occurance
## 19:   NRAS  CEBPA 0.0686935030  4.1246702 167  3 10 12       Co_Occurance
## 20:   FLT3  RUNX1 0.0741280180  0.1643861 125  1 15 51 Mutually_Exclusive
## 21:  FAM5C   EZH2 0.0764974855 21.5855527 185  1  2  4       Co_Occurance
## 22:  RUNX1 DNMT3A 0.0776417572  0.1840255 129  1 47 15 Mutually_Exclusive
## 23:   NPM1   TP53 0.0782787771  0.0000000 144 NA 15 33 Mutually_Exclusive
## 24:  RUNX1   NPM1 0.0788398517  0.0000000 143 NA 33 16 Mutually_Exclusive
## 25:  RUNX1  STAG2 0.0803208954  6.0319705 172  2  4 14       Co_Occurance
## 26:   TET2  STAG2 0.0897149618  5.6062361 171  2  4 15       Co_Occurance
## 27: DNMT3A   FLT3 0.0902829421  1.9339713 110 18 34 30       Co_Occurance
## 28:   NPM1   IDH1 0.0922742345  2.7040207 147  6 12 27       Co_Occurance
##      gene1  gene2       pValue  oddsRatio  00 11 01 10              Event
## 
## $gene_sets
##               gene_set       pvalue
## 1:   TP53, FLT3, RUNX1 4.830385e-05
## 2:    TP53, IDH2, FLT3 8.596593e-05
## 3: TP53, RUNX1, DNMT3A 5.109811e-03

We can visualize the above results using oncostrip. For example, in above analysis, gene set TP53, FLT3 and RUNX1 show a strong exclusiveness with each other a p-value of 4.8e-5.

oncostrip(maf = laml, genes = c('TP53', 'FLT3', 'RUNX1'))

7.2 Detecting cancer driver genes based on positional clustering

maftools has a function oncodrive which identifies cancer genes (driver) from a given MAF. oncodrive is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hot-spots). This method takes advantage of such positions to identify cancer genes. If you use this function, please cite OncodriveCLUST article 7.

laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')
head(laml.sig)

We can plot the results using plotOncodrive.

plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE)

plotOncodrive plots the results as scatter plot with size of the points proportional to the number of clusters found in the gene. X-axis shows number of mutations (or fraction of mutations) observed in these clusters. In the above example, IDH1 has a single cluster and all of the 18 mutations are accumulated within that cluster, giving it a cluster score of one. For details on oncodrive algorithm, please refer to OncodriveCLUST article 7.

7.3 Adding and summarizing pfam domains

maftools comes with the function pfamDomains, which adds pfam domain information to the amino acid changes. pfamDomain also summarizes amino acid changes according to the domains that are affected. This serves the purpose of knowing what domain in given cancer cohort, is most frequently affected. This function is inspired from Pfam annotation module from MuSic tool 8.

laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)

## Removed 50 mutations for which AA position was not available

#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = FALSE]

##         HGNC AAPos Variant_Classification  N total  fraction   DomainLabel
##    1: DNMT3A   882      Missense_Mutation 27    54 0.5000000 AdoMet_MTases
##    2:   IDH1   132      Missense_Mutation 18    18 1.0000000      PTZ00435
##    3:   IDH2   140      Missense_Mutation 17    20 0.8500000      PTZ00435
##    4:   FLT3   835      Missense_Mutation 14    52 0.2692308      PKc_like
##    5:   FLT3   599           In_Frame_Ins 10    52 0.1923077      PKc_like
##   ---                                                                     
## 1512: ZNF646   875      Missense_Mutation  1     1 1.0000000            NA
## 1513: ZNF687   554      Missense_Mutation  1     2 0.5000000            NA
## 1514: ZNF687   363      Missense_Mutation  1     2 0.5000000            NA
## 1515: ZNF75D     5      Missense_Mutation  1     1 1.0000000            NA
## 1516: ZNF827   427        Frame_Shift_Del  1     1 1.0000000            NA

#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = FALSE]

##        DomainLabel nMuts nGenes
##   1:      PKc_like    55      5
##   2:      PTZ00435    38      2
##   3: AdoMet_MTases    33      1
##   4:         7tm_1    24     24
##   5:       COG5048    17     17
##  ---                           
## 499:    ribokinase     1      1
## 500:   rim_protein     1      1
## 501: sigpep_I_bact     1      1
## 502:           trp     1      1
## 503:        zf-BED     1      1

Above plot and results shows AdoMet_MTases domain is frequently mutated, but number genes with this domain is just one (DNMT3A) compared to other domains such as 7tm_1 domain, which is mutated across 24 different genes. This shows the importance of mutations in methyl transfer domains in Leukemia.

7.4 Pan-Cancer comparison

Lawrence et al performed MutSigCV analysis on 21 cancer cohorts and identified over 200 genes to be significantly mutated which consists of previously undescribed novel genes 9. Their results show only few genes are mutated in multiple cohort while many of them are tissue/cohort specific. We can compare mutSig results against this pan-can list of significantly mutated genes to see genes specifically mutated in given cohort. This function requires MutSigCV results (usually named sig_genes.txt) as an input.

#MutsigCV results for TCGA-AML
laml.mutsig <- system.file("extdata", "LAML_sig_genes.txt.gz", package = "maftools")
laml.pancan = pancanComparison(mutsigResults = laml.mutsig, qval = 0.1, cohortName = 'LAML', inputSampleSize = 200, label = 1, normSampleSize = TRUE)

## Significantly mutated genes in LAML (q < 0.1): 23

## Significantly mutated genes in PanCan cohort (q <0.1): 114

## Significantly mutated genes exclusive to LAML (q < 0.1):

##       gene pancan            q nMut
##  1:  CEBPA  1.000 3.500301e-12   13
##  2:   EZH2  1.000 7.463546e-05    3
##  3: GIGYF2  1.000 6.378338e-03    2
##  4:    KIT  0.509 1.137517e-05    8
##  5:   PHF6  0.783 6.457555e-09    6
##  6: PTPN11  0.286 7.664584e-03    9
##  7:  RAD21  0.929 1.137517e-05    5
##  8:  SMC1A  0.801 2.961696e-03    6
##  9:   TET2  0.907 2.281625e-13   17
## 10:    WT1  1.000 2.281625e-13   12

Above results show genes such as CEBPa, TET2 and WT1 are specifically mutated in Leukemia.

7.5 Survival analysis

Survival analysis is an essential part of cohort based sequencing projects. Function mafSurvive performs survival analysis and draws kaplan meier curve by grouping samples based on mutation status of user defined gene(s) or manually provided samples those make up a group. This function requires input data to contain Tumor_Sample_Barcode (make sure they match to those in MAF file), binary event (1/0) and time to event.

Our annotation data already contains survival information and in case you have survival data stored in a separate table provide them via argument clinicalData

#Survival analysis based on grouping of DNMT3A mutation status
laml.dnmt3a.surv = mafSurvival(maf = laml, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)

## Looking for clinical data in annoatation slot of MAF..

## Number of mutated samples for given genes:

## DNMT3A 
##     48

## Median survival..

##     Group medianTime   N
## 1: Mutant        243  48
## 2:     WT        366 144

7.6 Comparing two cohorts (MAFs)

Cancers differ from each other in terms of their mutation pattern. We can compare two different cohorts to detect such differentially mutated genes. For example, recent article by Madan et. al 9, have shown that patients with relapsed APL (Acute Promyelocytic Leukemia) tends to have mutations in PML and RARA genes, which were absent during primary stage of the disease. This difference between two cohorts (in this case primary and relapse APL) can be detected using function mafComapre, which performs fisher test on all genes between two cohorts to detect differentially mutated genes.

#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)

#We will consider only genes which are mutated in at-least in 5 samples in one of the cohort, to avoid bias due to single mutated genes.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', minMut = 5)
print(pt.vs.rt)

## $results
##    Hugo_Symbol PrimaryAPL RelapseAPL         pval         or        ci.up
## 1:         PML          1         11 3.276258e-05 0.03868764 0.0008765166
## 2:        RARA          0          7 4.332703e-04 0.00000000 0.0000000000
## 3:        FLT3         26          4 8.883112e-03 4.26676459 1.3552412834
## 4:       RUNX1          1          5 1.866537e-02 0.09738200 0.0020113363
## 5:      ARID1B          5          8 6.207511e-02 0.29480926 0.0714860893
## 6:        NRAS         15          4 3.004921e-01 2.13939926 0.6316876315
## 7:         WT1         20         14 4.118662e-01 0.68823250 0.2924505351
## 8:        KRAS          6          1 4.224665e-01 3.27572809 0.3808369479
## 9:      ARID1A          7          4 1.000000e+00 0.91445360 0.2192017782
##         ci.low      adjPval
## 1:   0.2818500 0.0002948632
## 2:   0.3346038 0.0019497163
## 3:  17.9013510 0.0266493374
## 4:   0.9055227 0.0419970785
## 5:   1.0915108 0.1117351977
## 6:   9.3745934 0.4507381202
## 7:   1.6486725 0.4752748561
## 8: 154.5190905 0.4752748561
## 9:   4.4805164 1.0000000000
## 
## $SampleSummary
##        Cohort SampleSize
## 1: PrimaryAPL         97
## 2: RelapseAPL         51

Above results show two genes PML and RARA which are highly mutated in Relapse APL compared to Primary APL. We can visualize these results as a forestplot.

apl.pt.vs.rt.fp = forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.05, show = 'stat', color = c('royalblue', 'maroon'))

Another alternative way of displaying above results is by plotting two oncoplots side by side. coOncoplot function takes two maf objects and plots them side by side for better comparison.

genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)

7.7 Tumor heterogeneity and MATH scores

7.7.1 Heterogeneity in tumor samples

Tumors are generally heterogeneous i.e, consist of multiple clones. This heterogeneity can be inferred by clustering variant allele frequencies. inferHeterogeneity function uses vaf information to cluster variants (using mclust), to infer clonality. By default, inferHeterogeneity function looks for column t_vaf containing vaf information. However, if the field name is different from t_vaf, we can manually specify it using argument vafCol. For example, in this case study vaf is stored under the field name i_TumorVAF_WU.

#We will run this for sample TCGA.AB.2972
tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')

## Processing TCGA-AB-2972..

print(tcga.ab.2972.het$clusterMeans)

##    Tumor_Sample_Barcode cluster   meanVaf
## 1:         TCGA-AB-2972       2 0.4496571
## 2:         TCGA-AB-2972       1 0.2454750
## 3:         TCGA-AB-2972 outlier 0.3695000

#Visualizing results
plotClusters(clusters = tcga.ab.2972.het)

Above figure shows clear separation of two clones clustered at mean variant allele frequencies of ~45% (major clone) and another minor clone at variant allele frequency of ~25%.

Although clustering of variant allele frequencies gives us a fair idea on heterogeneity, it is also possible to measure the extent of heterogeneity in terms of a numerical value. MATH score (mentioned as a subtitle in above plot) is a simple quantitative measure of intra-tumor heterogeneity, which calculates the width of the vaf distribution. Higher MATH scores are found to be associated with poor outcome. MATH score can also be used a proxy variable for survival analysis 10.

7.7.2 Ignoring variants in copy number altered regions

We can use copy number information to ignore variants located on copynumber altered regions. Copy number alterations results in abnormally high/low variant allele frequencies, which tends to affect clustering. Removing such variants improves clustering and density estimation while retaining biologically meaningful results. Copy number information can be provided as a segmented file generated from segmentation programmes, such as Circular Binary Segmentation from “DNACopy” Bioconductor package 6.

seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')
tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')

## Processing TCGA-AB-3009..

## Removed 1 variants with no copy number data.

##    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
## 1:        PHF6         23      133551224    133551224         TCGA-AB-3009
##        t_vaf Segment_Start Segment_End Segment_Mean CN
## 1: 0.9349112            NA          NA           NA NA

## Copy number altered variants:

##    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
## 1:         NF1         17       29562981     29562981         TCGA-AB-3009
## 2:       SUZ12         17       30293198     30293198         TCGA-AB-3009
##        t_vaf Segment_Start Segment_End Segment_Mean       CN    cluster
## 1: 0.8419000      29054355    30363868      -0.9157 1.060173 CN_altered
## 2: 0.8958333      29054355    30363868      -0.9157 1.060173 CN_altered

#Visualizing results. Highlighting those variants on copynumber altered variants.
plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)

Above figure shows two genes NF1 and SUZ12 with high VAF’s, which is due to copy number alterations (deletion). Those two genes are ignored from analysis.

7.8 Mutational Signatures

Every cancer, as it progresses leaves a signature characterized by specific pattern of nucleotide substitutions. Alexandrov et.al have shown such mutational signatures, derived from over 7000 cancer samples 5. Such signatures can be extracted by decomposing matrix of nucleotide substitutions, classified into 96 substitution classes based on immediate bases surrounding the mutated base. Extracted signatures can also be compared to those validated signatures.

First step in signature analysis is to obtain the adjacent bases surrounding the mutated base and form a mutation matrix. NOTE: Even-though reading fasta and extracting bases is fairly fast, it is a memory consuming process as it occupies ~3gb of memory while running.

#First we extract adjacent bases to the mutated locus and clssify them into 96 substitution classes. This also estimates APOBEC enrichment per sample.
laml.tnm = trinucleotideMatrix(maf = laml, ref_genome = '/path/to/hg19.fa', 
                               prefix = 'chr', add = TRUE, ignoreChr = 'chr23', useSyn = TRUE)
# reading /path/to/hg19.fa (this might take few minutes)..
# Extracting 5' and 3' adjacent bases..
# Extracting +/- 20bp around mutated bases for background estimation..
# Estimating APOBEC enrichment scores.. 
# Performing one-way Fisher's test for APOBEC enrichment..
# APOBEC related mutations are enriched in 2.674% of samples (APOBEC enrichment score > 2 ; 5 of 187 samples)
# Creating mutation matrix..
# matrix of dimension 193x96

Above function performs two steps:

Estimates APOBEC enrichment scores
Prepares a mutational matrix for signature analysis.

7.8.1 APOBEC Enrichment estimation.

APOBEC induced mutations are more frequent in solid tumors and are mainly associated with C>T transition events occurring in TCW motif. APOBEC enrichment scores in the above command are estimated using the method described by Roberts et al 12. Briefly, enrichment of C>T mutations occurring within TCW motif over all of the C>T mutations in a given sample is compared to background Cytosines and TCWs occurring within 20bp of mutated bases.

\[\frac{n_{tCw} * background_C}{n_C * background_{TCW}}\]

One-sided fishers exact test is also performed to statistically evaluate the enrichment score, as described in original study by Roberts et al.

7.8.2 Differences between APOBEC enriched and non-enriched samples

We can also analyze the differences in mutational patterns between APOBEC enriched and non-APOBEC enriched samples. plotApobecDiff is a function which takes APOBEC enrichment scores estimated by trinucleotideMatrix and classifies samples into APOBEC enriched and non-APOBEC enriched. Once stratified, it compares these two groups to identify differentially altered genes.

Note that, LAML with no APOBEC enrichments, is not an ideal cohort for this sort of analysis and hence below plot is only for demonstration purpose.

plotApobecDiff(tnm = laml.tnm, maf = laml)

## $results
##      Hugo_Symbol Enriched nonEnriched       pval       or     ci.up
##   1:        TP53        2          13 0.05851471 7.905806 0.6100479
##   2:        FLT3        2          43 0.60324507 1.975770 0.1600477
##   3:       CEBPA        0          12 1.00000000 0.000000 0.0000000
##   4:        IDH1        0          18 1.00000000 0.000000 0.0000000
##   5:        IDH2        0          18 1.00000000 0.000000 0.0000000
##  ---                                                               
## 132:         WAC        0           2 1.00000000 0.000000 0.0000000
## 133:      ZBTB33        0           2 1.00000000 0.000000 0.0000000
## 134:      ZC3H18        0           2 1.00000000 0.000000 0.0000000
## 135:      ZNF687        0           2 1.00000000 0.000000 0.0000000
## 136:       RUNX1        0          15 1.00000000 0.000000 0.0000000
##         ci.low adjPval
##   1:  75.73718       1
##   2:  17.84629       1
##   3:  16.05102       1
##   4:   9.97986       1
##   5:   9.97986       1
##  ---                  
## 132: 199.22196       1
## 133: 199.22196       1
## 134: 199.22196       1
## 135: 199.22196       1
## 136:  12.36252       1
## 
## $SampleSummary
##         Cohort SampleSize  Mean Median
## 1:    Enriched          5 7.200      6
## 2: nonEnriched        171 9.772      9

7.8.3 Signature analysis

extractSignatures uses non-negative matrix factorization to decompose nx96 dimension matrix into r signatures, where n is the number of samples from input MAF 11. By default function runs nmf on 6 ranks and chooses the best possible value based on maximum cophenetic-correlation coefficient. It is also possible to manually specify r. Once decomposed, signatures are compared against known signatures derived from Alexandrov et.al, and cosine similarity is calculated to identify best match.

#Run main function with maximum 6 signatures. 
require('NMF')
laml.sign = extractSignatures(mat = laml.tnm, nTry = 6, plotBestFitRes = FALSE)
# Warning : Found zero mutations for conversions A[T>G]C
#Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.)
#Found Signature_1 most similar to validated Signature_1. Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.767] 
#Found Signature_2 most similar to validated Signature_1. Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.757]

plotSignatures(laml.sign)

extractSignatures gives a warning that no mutations are found for class A[T>G]C conversions. This is possible when the number of samples are low or in tumors with low mutation rate, such as in this case of Leukemia. In this scenario, a small positive value is added to avoid computational difficulties. It also prints other statistics for range of values that was tried, and chooses the rank with highest cophenetic metric (for above example r=2). Above stats should give an estimate of range of best possible r values and in case the chosen r is overfitted/underfitted, it is also possible to be re-run extractSignatures by manually specifying r.

Once decomposed, signatures are compared against known and validated signatures from Sanger 11. See here for list of validated signatures. In the above example, 2 signatures are derived, which are similar to validated Signature-1. Signature_1 is a result of elevated rate of spontaneous deamination of 5-methyl-cytosine, resulting in C>T transitions and predominantly occurs at NpCpG trinucleotide, which is a most common process in AML 12.

Full table of cosine similarities against validated signatures are also returned, which can be further analysed. Below plot shows comparison of similarities of detected signatures against validated signatures.

require('pheatmap')

## Loading required package: pheatmap

pheatmap::pheatmap(mat = laml.sign$coSineSimMat, cluster_rows = FALSE, main = "cosine similarity against validated signatures")

NOTE: Should you receive an error while running extractSignatures complaining none of the packages are loaded, please manually load the “NMF” library and re-run.

7.8.4 Plotting Signature contributions and sample stratification

plotSignatures also plots signature contribution as a barplot.

plotSignatures(nmfRes = laml.sign, contributions = TRUE)

Alternatively, you can perform PCA to stratify samples based on signature contribution. Once again note that, AML is not an ideal cohort for signature analysis, and this should be interpreted carefully.

nsig = ncol(laml.sign$signatures) #number of signatures identified
contrib = t(laml.sign$contributions) #contribution of signatures in every sample

require('ggfortify')

## Loading required package: ggfortify

## Loading required package: ggplot2

contrib.gg = autoplot(kmeans(contrib, nsig), data = contrib, frame = TRUE, frame.type = 'norm', loadings = TRUE, loadings.label = TRUE, loadings.label.size = 4, loadings.colour = 'blue')+cowplot::theme_cowplot(line_size = 1)+cowplot::background_grid(major = 'xy')

print(contrib.gg)

8 Variant Annotations

8.1 Annotating variants using Oncotator

We can also annotate variants using oncotator API 13. oncotate function quires oncotator web api to annotate given set of variants and converts them into MAF format. Input should be a five column file with chr, start, end, ref_allele, alt_allele. However, it can contain other information such as sample names (Tumor_Sample_Barcode), read counts, vaf information and so on, but only first five columns will be used, rest of the columns will be attached at the end of the table.

var.file = system.file('extdata', 'variants.tsv', package = 'maftools')
#This is what input looks like
var = read.delim(var.file, sep = '\t')
head(var)

##   chromsome    start      end     ref alt Tumor_Sample_Barcode
## 1      chr4 55589774 55589774       A   G               fake_1
## 2      chr4 55599321 55599321       A   T               fake_2
## 3      chr4 55599332 55599332       G   T               fake_3
## 4      chr4 55599320 55599320       G   T               fake_4
## 5     chr15 41961117 41961123 TGGCTAA   -               fake_4
## 6      chr4 55599320 55599320       G   T               fake_5

#Annotate 
var.maf = oncotate(maflite = var.file, header = TRUE)

#Results from oncotate. First 20 columns.
var.maf[1:10, 1:20, with = FALSE]

NOTE: This is quite time consuming if input is big.

8.2 Converting annovar output to MAF

Annovar is one of the most widely used Variant Annotation tool in Genomics 14. Annovar output is generally in a tabular format with various annotation columns. This function converts such annovar output files into MAF. This function requires that annovar was run with gene based annotation as a first operation, before including any filter or region based annotations.

e.g, table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol (refGene),cytoBand,dbnsfp30a -operation (g),r,f -nastring NA

annovarToMaf mainly uses gene based annotations for processing, rest of the annotation columns from input file will be attached to the end of the resulting MAF.

As an example we will annotate the same file which was used above to run oncotate function. We will annotate it using annovar with the following command. For simplicity, here we are including only gene based annotations but one can include as many annotations as they wish. But make sure the fist operation is always gene based annotation.

$perl table_annovar.pl variants.tsv ~/path/to/humandb/ -buildver hg19 
-out variants --otherinfo -remove -protocol ensGene -operation g -nastring NA

Output generated is stored as a part of this package. We can convert this annovar output into MAF using annovarToMaf.

var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools")
var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19', 
                               tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene')

## Converting Ensemble Gene IDs into HGNC gene symbols.

## Done! Original ensemble gene IDs are preserved under field name ens_id

##     Hugo_Symbol Entrez_Gene_Id  Center NCBI_Build Chromosome
##  1:         KIT             NA CSI-NUS       hg19       chr4
##  2:         KIT             NA CSI-NUS       hg19       chr4
##  3:         KIT             NA CSI-NUS       hg19       chr4
##  4:         KIT             NA CSI-NUS       hg19       chr4
##  5:         KIT             NA CSI-NUS       hg19       chr4
##  6:         KIT             NA CSI-NUS       hg19       chr4
##  7:         KIT             NA CSI-NUS       hg19       chr4
##  8:         MGA             NA CSI-NUS       hg19      chr15
##  9:         MGA             NA CSI-NUS       hg19      chr15
## 10:         MGA             NA CSI-NUS       hg19      chr15
##     Start_Position End_Position Strand Variant_Classification Variant_Type
##  1:       55589774     55589774      +      Missense_Mutation          SNP
##  2:       55599321     55599321      +      Missense_Mutation          SNP
##  3:       55599332     55599332      +      Missense_Mutation          SNP
##  4:       55599320     55599320      +      Missense_Mutation          SNP
##  5:       55599320     55599320      +      Missense_Mutation          SNP
##  6:       55599321     55599321      +      Missense_Mutation          SNP
##  7:       55599320     55599320      +      Missense_Mutation          SNP
##  8:       41989106     41989106      +        Frame_Shift_Ins          INS
##  9:       41961117     41961123      +        Frame_Shift_Del          DEL
## 10:       41989106     41989106      +        Frame_Shift_Ins          INS
##     Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
##  1:                A                 A                 G       NA
##  2:                A                 A                 T       NA
##  3:                G                 G                 T       NA
##  4:                G                 G                 T       NA
##  5:                G                 G                 T       NA
##  6:                A                 A                 T       NA
##  7:                G                 G                 C       NA
##  8:                -                 -          TAAAGGGC       NA
##  9:          TGGCTAA           TGGCTAA                 -       NA
## 10:                -                 -          TAAAGGGC       NA
##     Tumor_Sample_Barcode Mutation_Status AAChange   Transcript_Id
##  1:               fake_1         Somatic  p.D419G ENST00000412167
##  2:               fake_2         Somatic  p.D812V ENST00000412167
##  3:               fake_3         Somatic  p.D816Y ENST00000412167
##  4:               fake_4         Somatic  p.D812Y ENST00000412167
##  5:               fake_5         Somatic  p.D812Y ENST00000412167
##  6:               fake_6         Somatic  p.D812V ENST00000412167
##  7:               fake_7         Somatic  p.D812H ENST00000412167
##  8:               fake_7         Somatic p.G633fs ENST00000566718
##  9:               fake_4         Somatic   p.L9fs ENST00000566718
## 10:               fake_5         Somatic p.G633fs ENST00000566718
##                   TxChange GeneDetail.refGene hgnc_symbol Entrez
##  1:               c.A1256G                 NA         KIT     NA
##  2:               c.A2435T                 NA         KIT     NA
##  3:               c.G2446T                 NA         KIT     NA
##  4:               c.G2434T                 NA         KIT     NA
##  5:               c.G2434T                 NA         KIT     NA
##  6:               c.A2435T                 NA         KIT     NA
##  7:               c.G2434C                 NA         KIT     NA
##  8: c.1898_1899insTAAAGGGC                 NA         MGA     NA
##  9:             c.25_31del                 NA         MGA     NA
## 10: c.1898_1899insTAAAGGGC                 NA         MGA     NA
##              ens_id
##  1: ENSG00000157404
##  2: ENSG00000157404
##  3: ENSG00000157404
##  4: ENSG00000157404
##  5: ENSG00000157404
##  6: ENSG00000157404
##  7: ENSG00000157404
##  8: ENSG00000174197
##  9: ENSG00000174197
## 10: ENSG00000174197

Annovar, when used with Ensemble as a gene annotation source, uses ensemble gene IDs as Gene names. In that case, use annovarToMaf with argument table set to ensGene which converts ensemble gene IDs into HGNC symbols.

8.3 Converting ICGC Simple Somatic Mutation Format to MAF

Just like TCGA, International Cancer Genome Consortium ICGC also makes its data publicly available. But the data are stored in Simpale Somatic Mutation Format, which is similar to MAF format in its structure. However field names and classification of variants is different from that of MAF. icgcSimpleMutationToMAF is a function which reads ICGC data and converts them to MAF.

#Read sample ICGC data for ESCA
esca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools")
esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE)

## Converting Ensemble Gene IDs into HGNC gene symbols.

## Done! Original ensemble gene IDs are preserved under field name ens_id

## NOTE: Removed 427 duplicated variants

#Printing first 16 columns for display convenience.
print(esca.maf[1:5,1:16, with = FALSE])

##    Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
## 1:  AC005237.4             NA     NA     GRCh37          2      241987787
## 2:  AC061992.1            786     NA     GRCh37         17       76425382
## 3:  AC097467.2             NA     NA     GRCh37          4      156294567
## 4:    ADAMTS12             NA     NA     GRCh37          5       33684019
## 5:  AL589642.1          54801     NA     GRCh37          9       32630154
##    End_Position Strand Variant_Classification Variant_Type
## 1:    241987787      +                 Intron          SNP
## 2:     76425382      +                3'Flank          SNP
## 3:    156294567      +                 Intron          SNP
## 4:     33684019      +      Missense_Mutation          SNP
## 5:     32630154      +                    RNA          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
## 1:                C                 C                 T       NA
## 2:                C                 C                 T       NA
## 3:                A                 A                 G       NA
## 4:                A                 A                 C       NA
## 5:                T                 T                 C       NA
##    dbSNP_Val_Status Tumor_Sample_Barcode
## 1:               NA             SA514619
## 2:               NA             SA514619
## 3:               NA             SA514619
## 4:               NA             SA514619
## 5:               NA             SA514619

Note that by default Simple Somatic Mutation format contains all affected transcripts of a variant resulting in multiple entries of the same variant in same sample. It is hard to choose a single affected transcript based on annotations alone and by default this program removes repeated variants as duplicated entries. If you wish to keep all of them, set removeDuplicatedVariants to FALSE. Another option is to convert input file to MAF by removing duplicates and then use scripts like vcf2maf to re-annotate and prioritize affected transcripts.

8.4 Prepare MAF file for MutSigCV analysis

MutSig/MutSigCV is most widely used program for detecting driver genes 16. However, we have observed that covariates files (gene.covariates.txt and exome_full192.coverage.txt) which are bundled with MutSig have non-standard gene names (non Hugo_Symbols). This discrepancy between Hugo_Symbols in MAF and non-Hugo_symbols in covariates file causes MutSig program to ignore such genes. For example, KMT2D - a well known driver gene in Esophageal Carcinoma is represented as MLL2 in MutSig covariates. This causes KMT2D to be ignored from analysis and is represented as an insignificant gene in MutSig results. This function attempts to correct such gene symbols with a manually curated list of gene names compatible with MutSig covariates list.

laml.mutsig.corrected = prepareMutSig(maf = laml)
# Converting gene names for 1 variants from 1 genes
#    Hugo_Symbol MutSig_Synonym N
# 1:    ARHGAP35          GRLF1 1
# Original symbols are preserved under column OG_Hugo_Symbol.

9 Set operations

9.1 Subsetting MAF

We can also subset MAF using function subsetMaf

##Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933'  (Printing just 5 rows for display convenience)
subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'))[1:5]

##    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1:      ABCB11           8647 genome.wustl.edu         37          2
## 2:       ACSS3          79611 genome.wustl.edu         37         12
## 3:        ANK3            288 genome.wustl.edu         37         10
## 4:       AP1G2           8906 genome.wustl.edu         37         14
## 5:         ARC          23237 genome.wustl.edu         37          8
##    Start_Position End_Position Strand Variant_Classification Variant_Type
## 1:      169780250    169780250      +      Missense_Mutation          SNP
## 2:       81536902     81536902      +      Missense_Mutation          SNP
## 3:       61926581     61926581      +            Splice_Site          SNP
## 4:       24033309     24033309      +      Missense_Mutation          SNP
## 5:      143694874    143694874      +      Missense_Mutation          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
## 1:                G                 G                 A
## 2:                C                 C                 T
## 3:                C                 C                 A
## 4:                C                 C                 T
## 5:                C                 C                 G
##    Tumor_Sample_Barcode Protein_Change i_TumorVAF_WU i_transcript_name
## 1:         TCGA-AB-3009       p.A1283V      46.97218       NM_003742.2
## 2:         TCGA-AB-3009        p.A266V      47.32510       NM_024560.2
## 3:         TCGA-AB-3009                     43.99478       NM_020987.2
## 4:         TCGA-AB-3009        p.R346Q      47.08000       NM_003917.2
## 5:         TCGA-AB-3009        p.W253C      42.96435       NM_015193.3

##Same as above but return output as an MAF object
subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'), mafObj = TRUE)

## An object of class  MAF 
##                    ID          summary Mean Median
##  1:        NCBI_Build               37   NA     NA
##  2:            Center genome.wustl.edu   NA     NA
##  3:           Samples                2   NA     NA
##  4:            nGenes               34   NA     NA
##  5:   Frame_Shift_Ins                5  2.5    2.5
##  6:      In_Frame_Ins                1  0.5    0.5
##  7: Missense_Mutation               26 13.0   13.0
##  8: Nonsense_Mutation                2  1.0    1.0
##  9:       Splice_Site                1  0.5    0.5
## 10:             total               35 17.5   17.5

9.1.1 Specifying queries and controlling output fields.

##Select all Splice_Site mutations from DNMT3A and NPM1
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), query = "Variant_Classification == 'Splice_Site'")

##    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1:      DNMT3A           1788 genome.wustl.edu         37          2
## 2:      DNMT3A           1788 genome.wustl.edu         37          2
## 3:      DNMT3A           1788 genome.wustl.edu         37          2
## 4:      DNMT3A           1788 genome.wustl.edu         37          2
## 5:      DNMT3A           1788 genome.wustl.edu         37          2
## 6:      DNMT3A           1788 genome.wustl.edu         37          2
##    Start_Position End_Position Strand Variant_Classification Variant_Type
## 1:       25459874     25459874      +            Splice_Site          SNP
## 2:       25467208     25467208      +            Splice_Site          SNP
## 3:       25467022     25467022      +            Splice_Site          SNP
## 4:       25459803     25459803      +            Splice_Site          SNP
## 5:       25464576     25464576      +            Splice_Site          SNP
## 6:       25469029     25469029      +            Splice_Site          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
## 1:                C                 C                 G
## 2:                C                 C                 T
## 3:                A                 A                 G
## 4:                A                 A                 C
## 5:                C                 C                 A
## 6:                C                 C                 A
##    Tumor_Sample_Barcode Protein_Change i_TumorVAF_WU i_transcript_name
## 1:         TCGA-AB-2818        p.R803S         36.84       NM_022552.3
## 2:         TCGA-AB-2822                        62.96       NM_022552.3
## 3:         TCGA-AB-2891                        34.78       NM_022552.3
## 4:         TCGA-AB-2898                        46.43       NM_022552.3
## 5:         TCGA-AB-2934        p.G646V         37.50       NM_022552.3
## 6:         TCGA-AB-2968        p.E477*         40.00       NM_022552.3

##Same as above but include only 'i_transcript_name' column in the output
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name')

##    Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele
## 1:      DNMT3A          2       25459874     25459874                C
## 2:      DNMT3A          2       25467208     25467208                C
## 3:      DNMT3A          2       25467022     25467022                A
## 4:      DNMT3A          2       25459803     25459803                A
## 5:      DNMT3A          2       25464576     25464576                C
## 6:      DNMT3A          2       25469029     25469029                C
##    Tumor_Seq_Allele2 Variant_Classification Variant_Type
## 1:                 G            Splice_Site          SNP
## 2:                 T            Splice_Site          SNP
## 3:                 G            Splice_Site          SNP
## 4:                 C            Splice_Site          SNP
## 5:                 A            Splice_Site          SNP
## 6:                 A            Splice_Site          SNP
##    Tumor_Sample_Barcode i_transcript_name
## 1:         TCGA-AB-2818       NM_022552.3
## 2:         TCGA-AB-2822       NM_022552.3
## 3:         TCGA-AB-2891       NM_022552.3
## 4:         TCGA-AB-2898       NM_022552.3
## 5:         TCGA-AB-2934       NM_022552.3
## 6:         TCGA-AB-2968       NM_022552.3

10 Pre-compiled TCGA MAF objects

There is also an R data package containing pre-compiled TCGA MAF objects, particularly helpful for those working with TCGA mutation data. Every dataset is stored as an MAF object containing somatic mutations along with clinical information. Due to Bioconductor package size limits and other difficulties, this was not submitted to Bioconductor. However, you can still download TCGAmutations package from GitHub.

devtools::install_github(repo = "PoisonAlien/TCGAmutations")

11 References

Cancer Genome Atlas Research, N. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-74 (2013).
Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics (2016).
Mermel, C.H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12, R41 (2011).
Olshen, A.B., Venkatraman, E.S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557-72 (2004).
Alexandrov, L.B. et al. Signatures of mutational processes in human cancer. Nature 500, 415-21 (2013).
Leiserson, M.D., Wu, H.T., Vandin, F. & Raphael, B.J. CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genome Biol 16, 160 (2015).
Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238-44 (2013).
Dees, N.D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res 22, 1589-98 (2012).
Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, Meyerson M, Gabriel SB, Lander ES, Getz G: Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 2014, 505:495-501.
Madan, V. et al. Comprehensive mutational analysis of primary and relapse acute promyelocytic leukemia. Leukemia 30, 1672-81 (2016).
Mroz, E.A. & Rocco, J.W. MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral Oncol 49, 211-5 (2013).
Roberts SA, Lawrence MS, Klimczak LJ, et al. An APOBEC Cytidine Deaminase Mutagenesis Pattern is Widespread in Human Cancers. Nature genetics. 2013;45(9):970-976.
Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 11, 367 (2010).
Welch, J.S. et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 150, 264-78 (2012).
Ramos, A.H. et al. Oncotator: cancer variant annotation tool. Hum Mutat 36, E2423-9 (2015).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010).
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499:214-218.

12 Session Info

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggfortify_0.4.3     ggplot2_2.2.1       pheatmap_1.0.8     
## [4] maftools_1.4.28     bigmemory_4.5.33    Biobase_2.38.0     
## [7] BiocGenerics_0.24.0
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-6               matrixStats_0.53.1        
##   [3] bit64_0.9-7                doParallel_1.0.11         
##   [5] RColorBrewer_1.1-2         progress_1.1.2            
##   [7] httr_1.3.1                 rprojroot_1.3-2           
##   [9] GenomeInfoDb_1.14.0        tools_3.4.4               
##  [11] backports_1.1.2            R6_2.2.2                  
##  [13] DBI_0.8                    lazyeval_0.2.1            
##  [15] colorspace_1.3-2           GetoptLong_0.1.6          
##  [17] gridExtra_2.3              prettyunits_1.0.2         
##  [19] RMySQL_0.10.14             bit_1.1-12                
##  [21] compiler_3.4.4             DelayedArray_0.4.1        
##  [23] pkgmaker_0.22              labeling_0.3              
##  [25] slam_0.1-42                rtracklayer_1.38.3        
##  [27] scales_0.5.0               NMF_0.21.0                
##  [29] stringr_1.3.0              digest_0.6.15             
##  [31] Rsamtools_1.30.0           rmarkdown_1.9             
##  [33] cometExactTest_0.1.5       XVector_0.18.0            
##  [35] pkgconfig_2.0.1            htmltools_0.3.6           
##  [37] changepoint_2.2.2          BSgenome_1.46.0           
##  [39] rlang_0.2.0                GlobalOptions_0.0.13      
##  [41] RSQLite_2.1.0              bindr_0.1.1               
##  [43] shape_1.4.4                zoo_1.8-1                 
##  [45] mclust_5.4                 BiocParallel_1.12.0       
##  [47] dplyr_0.7.4                VariantAnnotation_1.24.5  
##  [49] RCurl_1.95-4.10            magrittr_1.5              
##  [51] GenomeInfoDbData_1.0.0     wordcloud_2.5             
##  [53] Matrix_1.2-14              Rcpp_0.12.16              
##  [55] munsell_0.4.3              S4Vectors_0.16.0          
##  [57] stringi_1.1.7              yaml_2.1.18               
##  [59] SummarizedExperiment_1.8.1 zlibbioc_1.24.0           
##  [61] plyr_1.8.4                 grid_3.4.4                
##  [63] blob_1.1.1                 ggrepel_0.7.0             
##  [65] bigmemory.sri_0.1.3        lattice_0.20-35           
##  [67] Biostrings_2.46.0          cowplot_0.9.2             
##  [69] splines_3.4.4              GenomicFeatures_1.30.3    
##  [71] circlize_0.4.3             knitr_1.20                
##  [73] ComplexHeatmap_1.17.1      pillar_1.2.1              
##  [75] GenomicRanges_1.30.3       rjson_0.2.15              
##  [77] rngtools_1.2.4             reshape2_1.4.3            
##  [79] codetools_0.2-15           biomaRt_2.34.2            
##  [81] stats4_3.4.4               glue_1.2.0                
##  [83] XML_3.98-1.10              evaluate_0.10.1           
##  [85] data.table_1.10.4-3        foreach_1.4.4             
##  [87] purrr_0.2.4                tidyr_0.8.0               
##  [89] gtable_0.2.0               assertthat_0.2.0          
##  [91] gridBase_0.4-7             xtable_1.8-2              
##  [93] survival_2.41-3            tibble_1.4.2              
##  [95] iterators_1.0.9            GenomicAlignments_1.14.2  
##  [97] AnnotationDbi_1.40.0       registry_0.5              
##  [99] memoise_1.1.0              IRanges_2.12.0            
## [101] bindrcpp_0.2.2             cluster_2.0.7-1

13 Support and acknowledgements

13.1 Github

If you have any issues, bug reports or feature requests please feel free to raise an issue on GitHub page.

13.2 Acknowledgements

Thanks to Ryan Morin for crucial bug fixes
Thanks to Peter Ellis for beautiful markdown template
somaticInteractions plot is inspired from the article Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes. Thanks to authors for making source code available.

maftools : Summarize, Analyze and Visualize MAF Files

Anand Mayakonda

2018-04-10