1 Overview

Small variants within the genome (single nucleotide variants/insertions/deletions) are a critical component in the basis for genetic diseases. The identification and summary of these types of variants is often a first step for the development of hypothesis regarding the role of these events in disease genesis and progression. The waterfall funtion is designed to effeciently summarize “small variant” (SNVs/indels) information at a cohort level. It is usefull for obtaining a broad sense of the type of variants observed in a cohort. Further waterfall will give a sense of the mutation burden, reccurently mutated genes, the mutually or co exclusivity between genes and the relation of variants to clinical data.

The purpose of this vignette is to display the many features of the waterfall function in order to give an in depth view of it’s parameters and functionality. For these examples the data frame brcaMAf originating from a truncated .maf file from TCGA and available within GenVisR will be used unless otherwise stated. Further for reproducability the seed for all examples has been set to == 426.

1.1 Functionality

1.1.1 Loading primary input

Parameters covered: fileType, variant_class_order

For basic use a user will only need to read a file of the proper type into R as a data frame and then supply this data frame to the waterfall function as the argument given to x. By default the data frame supplied is expected to correspond to a file in .maf (version 2.4) format (see below for additional supported formats). This data frame should have at a minimum the following column names “Tumor_Sample_Barcode”, “Hugo_Symbol”, “Variant_Classification”, and contain rows corresponding to mutation events. Further while any value is permissible for the “Tumor_Sample_Barcode” and “Hugo_Symbol” columns which correspond to a sample name and gene name respectively, specific values are expected for the “Variant_Classification” column (see table below). This is because waterfall is only capable of displaying a single variant type in the main plot for a cell (i.e. gene/sample). To achieve this waterfall will choose to plot the most deleterious variant based on a hierarchy predefined for a .maf file. This heiararchy follows the order from top to bottom of the legend output with the plot.

# Load the GenVisR package

# Plot with the MAF file type specified (default) The mainRecurCutoff
# parameter is described in the next section
waterfall(brcaMAF, fileType = "MAF", mainRecurCutoff = 0.05)

The user is capable of supplying additional file types to waterfall, if desireable. This is achievable via the fileType parameter. For example if it were to desireable to plot an annotation file from the Genome Modeling System the user would simply change the fileType to equal “MGI” and supply the corresponding file as the argument x. As with the .maf file a predefined heirarchy has been defined to plot the most deleterious mutations in cases where there are multiple mutations in the same gene/sample (see table below).

# read in a file from the genome modeling system
file <- read.delim("file.anno.tsv")

# Plot the variant information via waterfall
waterfall(file, fileType="MGI")

waterfall is also capable of plotting small variant information via a non-standard or unsupported file type. To do this the user should set the fileType parameter to “Custom”, and supply to as an argument to x a data frame with the columns “sample”, “gene”, “variant_class” corresponding to the “sample”, “gene”, and “variant type” respectively. Further the user is required to define which variants are considered most deleterious via the parameter variant_class_order for cases where there are multiple mutations in the same gene/sample. This should take the form of a character vector with values corresponding to the unique values in the column “variant_class” in order of most to least deleterious. As with the previous two examples the most deleterious mutation will be plotted. The “variant_class_order” parameter can be used to change the mutational heirarchy in the previous file types as well.

# make sure seed is set to 426 to reproduce!

# Create a data frame of random elements to plot
inputData <- data.frame(sample = sample(letters[1:5], 20, replace = TRUE), gene = sample(letters[1:5], 
    20, replace = TRUE), variant_class = sample(c("x", "y", "z"), 20, replace = TRUE))

# choose the most deleterious to plot with y being defined as the most
# deleterious
most_deleterious <- c("y", "z", "x")

# plot the data with waterfall using the 'Custom' parameter
waterfall(inputData, fileType = "Custom", variant_class_order = most_deleterious, 
    mainXlabel = TRUE)

# change the most deleterious order
waterfall(inputData, fileType = "Custom", variant_class_order = rev(most_deleterious), 
    mainXlabel = TRUE)
In cell e/e (second row/first column) two variants are present 'z' and 'x'. In the first plot variant 'z' is considered more deleterious (top panel), In the second plot variant 'x' is considered more deleterious (bottom panel).In cell e/e (second row/first column) two variants are present 'z' and 'x'. In the first plot variant 'z' is considered more deleterious (top panel), In the second plot variant 'x' is considered more deleterious (bottom panel).

Figure 1: In cell e/e (second row/first column) two variants are present ‘z’ and ‘x’
In the first plot variant ‘z’ is considered more deleterious (top panel), In the second plot variant ‘x’ is considered more deleterious (bottom panel).

Nonsense_Mutation nonsense
Frame_Shift_Ins frame_shift_del
Frame_Shift_Del frame_shift_ins
Translation_Start_Site splice_site_del
Splice_Site splice_site_ins
Nonstop_Mutation splice_site
In_Frame_Ins nonstop
In_Frame_Del in_frame_del
Missense_Mutation in_frame_ins
5’Flank missense
3’Flank splice_region_del
5’UTR splice_region_ins
3’UTR splice_region
RNA 5_prime_flanking_region
Intron 3_prime_flanking_region
IGR 3_prime_untranslated_region
Silent 5_prime_untranslated_region
Targeted_Region rna

1.1.2 Filtering options

Parameters covered: mainRecurCutoff, plotGenes, plotSamples, maxGenes, rmvSilent

Often it is the case that the input supplied to the waterfall function will contain thousands of genes and hundreds of samples. While waterfall can handle such scenarios the graphics device waterfall would neeed to output to would have to be enlarged to such a degree that the visualization may become unwieldy (see tips). To alleviate such issues waterfall provides a suite of filtering parameters to visualize the data of the most interest to the user. The first of these mainRecurCutoff accepts a numeric value between 0 and 1, and will only plot genes with mutations in x proportion of samples.

# Plot the genes with mutatations in >= 20% of samples
waterfall(brcaMAF, fileType = "MAF", mainRecurCutoff = 0.2)