aDepartment of Biology and Microbiology, South Dakota State University, Brookings, SD, USA; b,cBioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA; dPopulation Health Group at Sanford Research, Sioux Falls, SD, USA; eBioSNTR, Brookings, SD, USA.
The latest version of ViDGER
can be installed via GitHub using the devtools
package. If you do not have devtools
installed in your R
environment, you will need to obtain it through the install.packages()
function (see commented code):
# Install from GitHub
# if (!require("devtools")) install.packages("devtools")
# devtools::install_github("btmonier/vidger")
# Load vidger
require(vidger)
Once installed, you will have access to the following functions:
vsBoxplot()
vsScatterPlot()
vsScatterMatrix()
vsDEGMatrix()
vsMAPlot()
vsMAMatrix()
vsVolcano()
vsVolcanoMatrix()
vsFourWay()
Further explanation will be given to how these functions work later on in the documentation. For the following examples, three toy data sets will be used: df.cuff
, df.deseq
, and df.edger
. Each of these data sets reflect the three RNA-seq analyses this package covers. These can be loaded in the R workspace by using the following command:
Where <data_set>
is one of the previously mentioned data sets. Some of the recurring elements that are found in each of these functions are the type
and d.factor
arguments. The type
argument tells the function how to process the data for each analytical type (i.e. "cuffdiff"
, "deseq"
, or "edger"
). The d.factor
argument is used specifically for DESeq2
objects which we will discuss in the DESeq2 section. All other arguments are discussed in further detail by looking at the respective help file for each functions (i.e. ?vsScatterPlot
).
As mentioned earlier, three toy data sets are included with this package. In addition to these data sets, 5 “real-world” data sets were also used. All real-world data used is currently unpublished from ongoing collaborations. Summaries of this data can be found in the following tables:
Table 1: An overview of the toy data sets included in this package. In this table, each data set is summarized in terms of what analytical software was used, organism ID, experimental layout (replicates and treatments), number of transcripts (IDs), and size of the data object in terms of megabytes (MB).
Data | Software | Organism | Reps | Treat. | IDs | Size (MB) |
---|---|---|---|---|---|---|
df.cuff |
CuffDiff | H | 2 | 3 | 1200 | 0.2 |
sapiens | ||||||
df.deseq |
DESeq2 | D. | 2 | 3 | 29391 | 2.3 |
melanogaster | ||||||
df.deseq |
edgeR | A. | 2 | 3 | 724 | 0.1 |
thaliana |
Table 2: “Real-world” (RW) data set statistics. To test the reliability of our package, real data was used from human collections and several plant samples. Each data set is summarized in terms of organism ID, number of experimental samples (n), experimental conditions, and number of transcripts (IDs).
Data | Organism | n | Exp. Conditions | IDs |
---|---|---|---|---|
RW-1 | H. | 10 | Two treatment dosages taken at two | 198002 |
sapiens | time points and one control sample | |||
taken at one time point | ||||
RW-2 | M. | 24 | Two phenotypes taken at four time | 63517 |
domestia | points (three replicates each) | |||
RW-3 | V. | 6 | Two conditions (three replicates | 59262 |
ripria: | each). | |||
bud | ||||
RW-4 | V. | 6 | Two conditions (three replicates | 17962 |
ripria: | each). | |||
shoot-tip | ||||
(7 days) | ||||
RW-5 | V. | 6 | Two conditions (three replicates | 19064 |
ripria: | each). | |||
shoot-tip | ||||
(21 days) |
Box plots are a useful way to determine the distribution of data. In this case we can determine the distribution of FPKM or CPM values by using the vsBoxPlot()
function. This function allows you to extract necessary results-based data from analytical objects to create a box plot comparing \(log_{10}\) (FPKM or CPM) distributions for experimental treatments.
vsBoxPlot(
data = df.cuff, d.factor = NULL, type = 'cuffdiff', title = TRUE,
legend = TRUE, grid = TRUE
)
This example will look at a basic scatter plot function, vsScatterPlot()
. This function allows you to visualize comparisons of \(log_{10}\) values of either FPKM or CPM measurements of two treatments depending on analytical type.
vsScatterPlot(
x = 'hESC', y = 'iPS', data = df.cuff, type = 'cuffdiff',
d.factor = NULL, title = TRUE, grid = TRUE
)
This example will look at an extension of the vsScatterPlot()
function which is vsScatterMatrix()
. This function will create a matrix of all possible comparisons of treatments within an experiment with additional info.
vsScatterMatrix(
data = df.cuff, d.factor = NULL, type = 'cuffdiff',
comp = NULL, title = TRUE, grid = TRUE, man.title = NULL
)
Using the vsDEGMatrix()
function allows the user to visualize the number of differentially expressed genes (DEGs) at a given adjusted p-value (padj =
) for each experimental treatment level. Higher color intensity correlates to a higher number of DEGs.
vsDEGMatrix(
data = df.cuff, padj = 0.05, d.factor = NULL, type = 'cuffdiff',
title = TRUE, legend = TRUE, grid = TRUE
)
vsMAPlot()
visualizes the variance between two samples in terms of gene expression values where logarithmic fold changes of count data are plotted against mean counts. For more information on how each of the aesthetics are plotted, please refer to the figure captions and Method S1.
vsMAPlot(
x = 'iPS', y = 'hESC', data = df.cuff, d.factor = NULL,
type = 'cuffdiff', padj = 0.05, y.lim = NULL, lfc = NULL,
title = TRUE, legend = TRUE, grid = TRUE
)
Similar to a scatter plot matrix, vsMAMatrix()
will produce visualizations for all comparisons within your data set. For more information on how the aesthetics are plotted in these visualizations, please refer to the figure caption and Method S1.
vsMAMatrix(
data = df.cuff, d.factor = NULL, type = 'cuffdiff',
padj = 0.05, y.lim = NULL, lfc = 1, title = TRUE,
grid = TRUE, counts = TRUE, data.return = FALSE
)
The next few visualizations will focus on ways to display differential gene expression between two or more treatments. Volcano plots visualize the variance between two samples in terms of gene expression values where the \(-log_{10}\) of calculated p-values (y-axis) are a plotted against the \(log_2\) changes (x-axis). These plots can be visualized with the vsVolcano()
function. For more information on how each of the aesthetics are plotted, please refer to the figure captions and Method S1.
vsVolcano(
x = 'iPS', y = 'hESC', data = df.cuff, d.factor = NULL,
type = 'cuffdiff', padj = 0.05, x.lim = NULL, lfc = NULL,
title = TRUE, legend = TRUE, grid = TRUE, data.return = FALSE
)
Similar to the prior matrix functions, vsVolcanoMatrix()
will produce visualizations for all comparisons within your data set. For more information on how the aesthetics are plotted in these visualizations, please refer to the figure caption and Method S1.
vsVolcanoMatrix(
data = df.cuff, d.factor = NULL, type = 'cuffdiff',
padj = 0.05, x.lim = NULL, lfc = NULL, title = TRUE,
legend = TRUE, grid = TRUE, counts = TRUE
)
To create four-way plots, the function, vsFourWay()
is used. This plot compares the \(log_2\) fold changes between two samples and a ‘control’. For more information on how each of the aesthetics are plotted, please refer to the figure captions and Method S1.
vsFourWay(
x = 'iPS', y = 'hESC', control = 'Fibroblasts', data = df.cuff,
d.factor = NULL, type = 'cuffdiff', padj = 0.05, x.lim = NULL,
y.lim = NULL, lfc = NULL, legend = TRUE, title = TRUE, grid = TRUE
)
The shape and size of each data point will also change depending on several conditions. To maximize the viewing area while retaining high resolution, some data points will not be present within the viewing area. If they exceed the viewing area, they will change shape from a circle to a triangular orientation.
The extent (i.e. fold change) to how far these points exceed the viewing area are based on the following criteria:
To further clarify theses conditions, please refer to the following figure:
Function efficiencies were determined by calculating system times by using the microbenchmark
R package. Each function was ran 100 times with the prior code used in the documentation. All benchmarks were determined on a machine running a 64-bit Windows 10 operating system, 8 GB of RAM, and an Intel Core i5-6400 processor running at 2.7 GHz.