DEBrowser:

Interactive Differential Expression Analysis Tool

Introduction

Differential expression (DE) analysis has become an increasingly popular tool in determining and viewing up and/or down expressed genes between two sets of samples. The goal of differential gene expression analysis is to find genes or transcripts whose difference in expression, when accounting for the variance within condition, is higher than expected by chance. DESeq2 is an R package available via Bioconductor and is designed to normalize count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression (Love et al. 2014). With multiple parameters such as padjust values, log fold changes, plot styles, and so on, altering plots created with your DE data can be a hassle as well as time consuming. The Differential Expression Browser uses DESeq2 (Love et al., 2014), EdgeR (Robinson et al., 2010), and Limma (Ritchie et al., 2015) coupled with shiny (Chang, W. et al., 2016) to produce real-time changes within your plot queries and allows for interactive browsing of your DE results. In addition to DE analysis, DEBrowser also offers a variety of other plots and analysis tools to help visualize your data even further.

DEBrowser

DEBrowser utilizes Shiny, a R based application development tool that creates a wonderful interactive user interface (UI) combined with all of the computing prowess of R. After the user has selected the data to analyze and has used the shiny UI to run DE analysis, the results are then input to DEBrowser. DEBrowser manipulates your results in a way that allows for interactive plotting by which changing padj or fold change limits also changes the displayed graph(s). For more details about these plots and tables, please visit our quick start guide for some helpful tutorials.

For comparisons against other popular data visualization tools, see the comparison table below (Figure 40).

Quick start

Before you start; you will have to install R and/or RStudio.

# Installation instructions:
# 1. Install DEBrowser and its dependencies by running the lines below
# in R or RStudio.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("debrowser")

# 2. Load the library

library(debrowser)

# 3. Start DEBrowser

startDEBrowser()

Please check Operating System Dependencies section, in case your operating system requires packages to be installed.

Browsing your Data

Data via TSV file

Once you’ve made your way to the website, or you have a local instance of DEBrowser running, you will be greeted with data loading section:

Figure 1. Data loading.

To begin the analysis, you need to upload your count data file (comma or semicolon separated (CSV), and tab separated (TSV) format) to be analyzed and choose appropriate separator for the file (comma, semicolon or tab).

Gene quantifications table can be obtained running standard software like HTSeq (Anders,S. et al, 2014) or RSEM (Li and Dewey, 2011). The file values must contain the gene, transcript(s), and the sample raw count values you wish to enter into DEBrowser.

If you do not have a dataset to upload, you can use the built in demo data file by clicking on the ‘Load Demo (Vernia et al.)!’ button. To view the entire demo data file, you can download this demo set: https://umms.dolphinnext.com/pub/debrowser/simple_demo.tsv . For another example, try our full dataset (Vernia et. al): https://umms.dolphinnext.com/pub/debrowser/advanced_demo.tsv .

The structure of the count data files are shown below:

gene transcript exper_rep1 exper_rep2 control_rep1 control_rep2
DQ714826 uc007tfl.1 0.00 0.00 0.00 0.00
DQ551521 uc008bml.1 0.00 0.00 0.00 0.00
AK028549 uc011wpi.1 2.00 1.29 0.00 0.00

Please also note that, DEBrowser reads the gene names from the first column and skips other non numerical columns and starts reading the quantification values from the 3rd column in this case.

In addition to the count data file; you need to upload metadata file to correct for batch effects or any other normalizing conditions you might want to address that might be within your results. To handle for these conditions, simply create a metadata file by using the example table at below or download sample file from following link: https://umms.dolphinnext.com/pub/debrowser/simple_demo_meta.txt

sample batch condition
exper_rep1 1 A
exper_rep2 2 A
exper_rep3 1 A
control_rep1 2 B
control_rep2 1 B
control_rep3 2 B

Metadata file can be formatted with comma, semicolon or tab separators similar to count data files. These files used to establish different batch effects for multiple conditions. You can have as many conditions as you may require, as long as all of the samples are present.

  • The example above would result in the first set of conditions as exper_rep1, exper_rep2, exper_rep3 from A and second set of conditions as control_rep1, control_rep2, control_rep3 from B as they correspond to those conditions in the condition column.

  • In the same way, ‘batch’ would have the first set as exper_rep1, exper_rep3, control_rep2 from 1 and second set as exper_rep2, control_rep1, control_rep3 from 2 as they correspond to those conditions in the batch column.

Once the count data and metadata files have been loaded in DEBrowser, you can click upload button to visualize your data as shown at below:

Figure 2. Upload Summary

After loading the gene quantification file, and if specified the metadata file containing your batch correction fields, you then have the option to filter low counts and conduct batch effect correction prior to your analysis. Alternatively, you may skip these steps and directly continue with differential expression analysis or view quality control (QC) information of your dataset.

Low Count Filtering

In this section, you can simultaneously visualize the changes of your dataset while filtering out the low count genes. Choose your filtration criteria from Filtering Methods box which is located just center of the screen. Three methods are available to be used:

After selection of filtering methods and entering threshold value, you can proceed by clicking Filter button which is located just bottom part of the Filtering Methods box. On the right part of the screen, your filtered dataset will be visualized for comparison as shown at figure below.

Figure 3. Filtering

You can easily compare following features, before and after filtering:

Figure 4. Show Data

Afterwards, you may continue your analysis with Batch Effect Correction or directly jump to differential expression analysis or view quality control (QC) information of your dataset.

Batch Effect Corrections

If specified metadata file containing your batch correction fields, then you have the option to conduct batch effect correction prior to your analysis. By adjusting parameters of Options box, you can investigate your character of your dataset. These parameters of the options box are explained as following:

Upon clicking submit button, comparison tables and plots will be created on the right part of the screen as shown below.

Figure 5. Batch Effect Correction - PCA

Figure 6. Batch Effect Correction - IQR

Figure 7. Batch Effect Correction - Density

You can investigate the changes on the data by comparing following features:

Since we have completed batch effect correction and normalization step, we can continue with one of the following options: ‘Go to DE Analysis’ and, ‘Go to QC plots!’. First option takes you to page where differential expression analyses are conducted with DESeq2, EdgeR or Limma. The second option, ‘Go to QC plots!’, takes you to a page where you can view quality control metrics of your data by PCA, All2All, Heatmap, Density, and IQR plots.

DE Analysis:

The first option, ‘Go to DE Analysis’, takes you to the next step where differential expression analyses are conducted.

Figure 8. DE Selection

If you need to remove samples from a condition, simply select the sample you wish to remove and hit the delete/backspace key. In case, you need to add a sample to a condition you can click on one of the condition text boxes to bring up a list of samples and then click on the sample you wish to add from the list and it will be added to the textbox for that comparison.

After clicking on the ‘Submit!’ button, DESeq2 will analyze your comparisons and store the results into separate data tables. It is important to note that the resulting data produced from DESeq is normalized. Upon finishing the DESeq analysis, a result table will appear which allows you to download the data by clicking “Download” button. To visualize the data with interactive plots please click on “Go to Main Plots!” button.

The Main Plots of DE Analysis:

Upon finishing the DESeq analysis, please click on Go to Main Plots! button which will open Main Plots tab where you will be able to view the interactive plots.

Figure 9. Info Tabs

The page will load with Scatter Plot. You can switch to Volcano Plot and MA Plot by using Plot Type section at the left side of the menu. Since these plots are interactive, you can click to zoom button on the top of the graph and select the area you would like to zoom in by drawing a rectangle. Please see the plots at below:

Figure 10. Main Plots

A. Scatter plot, B. Volcano plot, C. MA plot

  • Please keep in mind that to increase the performance of the generating graph, by default 10% of non-significant(NS) genes are used to generate plots. You might show all NS genes by please click Main Options button and change Background Data(%) to 100% on the left sidebar.

Figure 11. Background data

You can hover over the scatterplot points to display more information about the point selected. A few bargraphs will be generated for the user to view as soon as a scatterplot point is hovered over.

Figure 12. main plot hover

A. Hover on Fabp3 gene, B. Read Counts vs Samples, C. Read Counts vs Conditions

You also have a wide array of options when it comes to fold change cut-off levels, padj cut-off values, which comparison set to use, and dataset of genes to analyze.

Figure 13. main plot filters

  • It is important to note that when conducting multiple comparisons, the comparisons are labeled based on the order that they are input. If you don’t remember which samples are in your current comparison you can always view the samples in each condition at the top of the main plots.

Figure 14. Selected conditions

After DE analysis, you can always download the results in CSV format by clicking the Download Data button located under the Data Options. You can also download the plot or graphs by clicking on the download button at top of each plot or graph.

Differential Expression Calculations

DESeq2

For the details please check the user guide at this location: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf

DESeq2 performs multiple steps in order to analyze the data you’ve provided for it. The first step is to indicate the condition that each column (experiment) in the table represent. You can group multiple samples into one condition column. DESeq2 will compute the probability that a gene is differentially expressed (DE) for ALL genes in the table. It outputs both a nominal and a multiple hypothesis corrected p-adjusted (padj) value using a negative binomial distribution.

Un-normalized counts

DESeq2 requires count data as input obtained from RNA-Seq or another high-throughput sequencing experiment in the form of matrix values. Here we convert un-integer values to integer to be able to run DESeq2. The matrix values should be un-normalized, since DESeq2 model internally corrects for library size. So, transformed or normalized values such as counts scaled by library size should not be used as input. Please use edgeR or limma for normalized counts.

Used parameters for DESeq2

  • fitType: Either “parametric”, “local”, or “mean” for the type of fitting of dispersions to the mean intensity. See estimate Dispersions for description.

  • betaPrior: Whether or not to put a zero-mean normal prior on the non-intercept coefficients See nbinomWaldTest for description of the calculation of the beta prior. By default, the beta prior is used only for the Wald test, but can also be specified for the likelihood ratio test.

  • testType: Either “Wald” or “LRT”, which will then use either Wald significance tests (defined by nbinomWaldTest), or the likelihood ratio test on the difference in deviance between a full and reduced model formula (defined by nbinomLRT)

  • rowsum.filter: Regions/Genes/Isoforms with total count (across all samples) below this value will be filtered out

EdgeR

For the details please check the user guide at this location: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Used parameters for EdgeR

  • Normalization: Calculate normalization factors to scale the raw library sizes. Values can be “TMM”, “RLE”, ”upperquartile”, or ”none”.

  • Dispersion: Either a numeric vector of dispersions or a character string indicating that dispersions should be taken from the data object.

  • testType: exactTest or glmLRT. exactTest: Computes p-values for differential abundance for each gene between two samples, conditioning on the total count for each gene. The counts in each group are assumed to follow a binomial distribution. glmLRT: Fits a negative binomial generalized log-linear model to the read counts for each gene and conducts genewise statistical tests.

  • rowsum.filter: Regions/Genes/Isoforms with total count (across all samples) below this value will be filtered out

Limma

For the details please check the user guide at this location: https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

Limma is a package to analyze of microarray or RNA-Seq data. If data is normalized with spike-in or any other scaling, transformation or normalization method, Limma can be ideal. In that case, prefer limma rather than DESeq2 or EdgeR.

Used parameters for Limma

  • Normalization: Calculate normalization factors to scale the raw library sizes. Values can be “TMM”,”RLE”,”upperquartile”, or ”none”.

  • Fit Type: Fitting method: “ls” for least squares or “robust” for robust regression

  • Norm. Bet. Arrays: Normalization Between Arrays; Normalizes expression intensities so that the intensities or log-ratios have similar distributions across a set of arrays.

  • rowsum.filter: Regions/Genes/Isoforms with total count (across all samples) below this value will be filtered out

The Heatmap of DE Analysis

Once you’ve selected a specific region on Main Plots (Scatter, Volcano or MA plot), a new heatmap of the selected area will appear just next to your plot. If you want to hide some groups (such as Up, Down or NS based on DE analysis), just click on the group label on the top right part of the figure. In this way, you can select a specific part of the genes by lasso select or box select tools that includes only Up or Down Regulated genes. As soon as you completed your selection, heatmap will be created simultaneously.

Figure 15. main plot selection

A. Box Selection, B. Lasso Selection, C. Created heatmap based on selection

  • We strongly recommend normalization before plotting heatmaps. To normalize, please change the parameters that are located under: Data options -> Normalization Methods and select the method from the dropdown box.

Used clustering and linkage methods in heatmap

  • complete: Complete-linkage clustering is one of the linkage method used in hierarchical clustering. In each step of clustering, closest cluster pairs are always merged up to a specified distance threshold. Distance between clusters for complete link clustering is the maximum of the distances between the members of the clusters.

  • ward D2: Ward method aims to find compact and spherical clusters. The distance between two clusters is calculated by the sum of squared deviations from points to centroids. “ward.D2” method uses criterion (Murtagh and Legendre 2014) to minimize ward clustering method. The only difference ward.D2 and ward is the dissimilarities from ward method squared before cluster updating. This method tends to be sensitive to the outliers.

  • single: Distance between clusters for single linkage is the minimum of the distances between the members of the clusters.

  • average: Distance between clusters for average linkage is the average of the distances between the members of the clusters.

  • mcquitty: mcquitty linkage is when two clusters are joined, the distance of the new cluster to any other cluster is calculated by the average of the distances of the soon to be joined clusters to that other cluster.

  • median: This is a different averaging method that uses the median instead of the mean. It is used to reduce the effect of outliers.

  • centroid: The distance between cluster pairs is defined as the Euclidean distance between their centroids or means.

Used distance methods in heatmap

  • cor: 1 - cor(x) are used to define the dissimilarity between samples. It is less sensitive to the outliers and scaling.

  • euclidean: It is the most common use of distance. It is sensitive to the outliers and scaling. It is defined as the square root of the sum of the square differences between gene counts.

  • maximum: The maximum distance between two samples is the sum of the maximum expression value of the corresponding genes.

  • manhattan: The Manhattan distance between two samples is the sum of the differences of their corresponding genes.

  • canberra: Canberra distance is similar to the Manhattan distance and it is a special form of the Minkowski distance. The difference is that the absolute difference between the gene counts of the two genes is divided by the sum of the absolute counts prior to summing.

  • minkowsky: It is generalized form of euclidean distance.

Interactive Heatmap

You can also select to view an interactive version of the heatmap by clicking on the ‘Interactive’ checkbox on the left panel under the height and width options. Selecting this feature changes the heatmap into an interactive version with two colors, allowing you to select specific genes to be compared within the GO term plots.

Just like in the Main Plots, you can click and drag to create a selection. To select a specific portion of the heatmap, make sure to highlight the middle of the heatmap gene box in order to fully select a specific gene. This selection can be used later within the GO Term plots for specific queries on your selection.

Figure 16. interactive heatmap

A. Before Selection B. Selection of area with zoom tool C. Zoomed heatmap region which allows better viewing resolution.

Interactive Feature: In order to increase the performance of the generating heatmaps, interactive option is disabled by default. After deciding plotting/clustering parameters of the heatmap, you might activate this feature to investigate each block in detail.

The Scale Option of Heatmap

By using Scale Option field on the left sidebar menu, it is possible to adjust scaling parameters of DEBrowser. There are four main options:

  1. Center: If it is checked then centering is done by subtracting the column means of data from their corresponding columns. Otherwise no centering is done.(Default value:Checked)
  2. Scale: The value of scale determines how column scaling is performed (after centering). If scale is checked then scaling is done by dividing the (centered) columns of the data by their standard deviations if center is checked, and the root mean square if center is unchecked. If scale is unchecked, no scaling is done.(Default value:Checked)
  3. Log: The value of log determines the log2 operation of data matrix (Default value:Checked)
  4. Pseudo-Count: This value added to each element to prevent getting undefined (logarithm of zero) before calculation of log2(Default value:0.1)

heatmap scale

GO Term Plots

The next tab, ‘GO Term’, takes you to the ontology comparison portion of DEBrowser. From here you can select the standard dataset options such as p-adjust value, fold change cut off value, which comparison set to use, and which dataset to use on the left menu. In addition to these parameters, you also can choose from the 4 different ontology plot options: ‘enrichGO’, ‘enrichKEGG’, ‘Disease’, and ‘compareCluster’. Selecting one of these plot options queries their specific databases with your current DESeq results.

Figure 17. GO plots opts

Your GO plots include:

The types of plots you will be able to generate include:

Summary plot:

Figure 18. go summary

GOdotplot:

Figure 19. go dot plot

Changing the type of ontology to use will also produce custom parameters for that specific ontology at the bottom of the left option panel.

Once you have adjusted all of your parameters, you may hit the submit button in the top right and then wait for the results to show on screen!

Data Tables

The last tab at the top of the screen displays various different data tables. These datatables include: