Interactive Differential Expression Analysis Tool
Differential expression (DE) analysis has become an increasingly popular tool in determining and viewing up and/or down expressed genes between two sets of samples. The goal of differential gene expression analysis is to find genes or transcripts whose difference in expression, when accounting for the variance within condition, is higher than expected by chance. DESeq2 is an R package available via Bioconductor and is designed to normalize count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression (Love et al. 2014). With multiple parameters such as padjust values, log fold changes, plot styles, and so on, altering plots created with your DE data can be a hassle as well as time consuming. The Differential Expression Browser uses DESeq2 (Love et al., 2014), EdgeR (Robinson et al., 2010), and Limma (Ritchie et al., 2015) coupled with shiny (Chang, W. et al., 2016) to produce real-time changes within your plot queries and allows for interactive browsing of your DE results. In addition to DE analysis, DEBrowser also offers a variety of other plots and analysis tools to help visualize your data even further.
DEBrowser utilizes Shiny, a R based application development tool that creates a wonderful interactive user interface (UI) combined with all of the computing prowess of R. After the user has selected the data to analyze and has used the shiny UI to run DE analysis, the results are then input to DEBrowser. DEBrowser manipulates your results in a way that allows for interactive plotting by which changing padj or fold change limits also changes the displayed graph(s). For more details about these plots and tables, please visit our quick start guide for some helpful tutorials.
For comparisons against other popular data visualization tools, see the comparison table below (Figure 40).
Before you start; you will have to install R and/or RStudio.
# Installation instructions:
# 1. Install DEBrowser and its dependencies by running the lines below
# in R or RStudio.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("debrowser")
# 2. Load the library
library(debrowser)
# 3. Start DEBrowser
startDEBrowser()
Please check Operating System Dependencies section, in case your operating system requires packages to be installed.
Once you’ve made your way to the website, or you have a local instance of DEBrowser running, you will be greeted with data loading section:
Figure 1. Data loading.
To begin the analysis, you need to upload your count data file (comma or semicolon separated (CSV), and tab separated (TSV) format) to be analyzed and choose appropriate separator for the file (comma, semicolon or tab).
Gene quantifications table can be obtained running standard software like HTSeq (Anders,S. et al, 2014) or RSEM (Li and Dewey, 2011). The file values must contain the gene, transcript(s), and the sample raw count values you wish to enter into DEBrowser.
If you do not have a dataset to upload, you can use the built in demo data file by clicking on the ‘Load Demo (Vernia et al.)!’ button. To view the entire demo data file, you can download this demo set: https://umms.dolphinnext.com/pub/debrowser/simple_demo.tsv . For another example, try our full dataset (Vernia et. al): https://umms.dolphinnext.com/pub/debrowser/advanced_demo.tsv .
The structure of the count data files are shown below:
| gene | transcript | exper_rep1 | exper_rep2 | control_rep1 | control_rep2 |
|---|---|---|---|---|---|
| DQ714826 | uc007tfl.1 | 0.00 | 0.00 | 0.00 | 0.00 |
| DQ551521 | uc008bml.1 | 0.00 | 0.00 | 0.00 | 0.00 |
| AK028549 | uc011wpi.1 | 2.00 | 1.29 | 0.00 | 0.00 |
Please also note that, DEBrowser reads the gene names from the first column and skips other non numerical columns and starts reading the quantification values from the 3rd column in this case.
In addition to the count data file; you need to upload metadata file to correct for batch effects or any other normalizing conditions you might want to address that might be within your results. To handle for these conditions, simply create a metadata file by using the example table at below or download sample file from following link: https://umms.dolphinnext.com/pub/debrowser/simple_demo_meta.txt
| sample | batch | condition |
|---|---|---|
| exper_rep1 | 1 | A |
| exper_rep2 | 2 | A |
| exper_rep3 | 1 | A |
| control_rep1 | 2 | B |
| control_rep2 | 1 | B |
| control_rep3 | 2 | B |
Metadata file can be formatted with comma, semicolon or tab separators similar to count data files. These files used to establish different batch effects for multiple conditions. You can have as many conditions as you may require, as long as all of the samples are present.
The example above would result in the first set of conditions as exper_rep1, exper_rep2, exper_rep3 from A and second set of conditions as control_rep1, control_rep2, control_rep3 from B as they correspond to those conditions in the condition column.
In the same way, ‘batch’ would have the first set as exper_rep1, exper_rep3, control_rep2 from 1 and second set as exper_rep2, control_rep1, control_rep3 from 2 as they correspond to those conditions in the batch column.
Once the count data and metadata files have been loaded in DEBrowser, you can click upload button to visualize your data as shown at below:
Figure 2. Upload Summary
After loading the gene quantification file, and if specified the metadata file containing your batch correction fields, you then have the option to filter low counts and conduct batch effect correction prior to your analysis. Alternatively, you may skip these steps and directly continue with differential expression analysis or view quality control (QC) information of your dataset.
In this section, you can simultaneously visualize the changes of your dataset while filtering out the low count genes. Choose your filtration criteria from Filtering Methods box which is located just center of the screen. Three methods are available to be used:
After selection of filtering methods and entering threshold value, you can proceed by clicking Filter button which is located just bottom part of the Filtering Methods box. On the right part of the screen, your filtered dataset will be visualized for comparison as shown at figure below.
Figure 3. Filtering
You can easily compare following features, before and after filtering:
Number of genes/regions.
Read counts for each sample.
Overall histogram of the dataset.
gene/region vs samples data
To investigate the gene/region vs samples data in detail as shown at below, you may click the Show Data button, located bottom part of the data tables. Alternatively, you may download all filtered data by clicking Download button which located next to Show Data button.
Figure 4. Show Data
Afterwards, you may continue your analysis with Batch Effect Correction or directly jump to differential expression analysis or view quality control (QC) information of your dataset.
If specified metadata file containing your batch correction fields, then you have the option to conduct batch effect correction prior to your analysis. By adjusting parameters of Options box, you can investigate your character of your dataset. These parameters of the options box are explained as following:
Upon clicking submit button, comparison tables and plots will be created on the right part of the screen as shown below.
Figure 5. Batch Effect Correction - PCA
Figure 6. Batch Effect Correction - IQR
Figure 7. Batch Effect Correction - Density
You can investigate the changes on the data by comparing following features:
Read counts for each sample.
PCA, IQR and Density plot of the dataset.
Gene/region vs samples data
You can investigate the gene/region vs samples data in detail by clicking the Show Data button, or download all corrected data by clicking Download button.
Since we have completed batch effect correction and normalization step, we can continue with one of the following options: ‘Go to DE Analysis’ and, ‘Go to QC plots!’. First option takes you to page where differential expression analyses are conducted with DESeq2, EdgeR or Limma. The second option, ‘Go to QC plots!’, takes you to a page where you can view quality control metrics of your data by PCA, All2All, Heatmap, Density, and IQR plots.
The first option, ‘Go to DE Analysis’, takes you to the next step where differential expression analyses are conducted.
Condition 1 and Condition 2 based on the treatment column of the metadata as shown below.Figure 8. DE Selection
If you need to remove samples from a condition, simply select the sample you wish to remove and hit the delete/backspace key. In case, you need to add a sample to a condition you can click on one of the condition text boxes to bring up a list of samples and then click on the sample you wish to add from the list and it will be added to the textbox for that comparison.
You can add multiple conditions to compare by clicking on “Add New Comparison” button, and view the results separately after DE analysis.
Method Selection: Three DE methods are available for DEBrowser: DESeq2, EdgeR, and Limma. DESeq2 and EdgeR are designed to normalize count data from high-throughput sequencing assays such as RNA-Seq. On the other hand, Limma is a package to analyse of normalized or transformed data from microarray or RNA-Seq assays. We have selected DESeq2 for our test sample and showed the related results at below.
After clicking on the ‘Submit!’ button, DESeq2 will analyze your comparisons and store the results into separate data tables. It is important to note that the resulting data produced from DESeq is normalized. Upon finishing the DESeq analysis, a result table will appear which allows you to download the data by clicking “Download” button. To visualize the data with interactive plots please click on “Go to Main Plots!” button.
Upon finishing the DESeq analysis, please click on Go to Main Plots! button which will open Main Plots tab where you will be able to view the interactive plots.
Figure 9. Info Tabs
The page will load with Scatter Plot. You can switch to Volcano Plot and MA Plot by using Plot Type section at the left side of the menu. Since these plots are interactive, you can click to zoom button on the top of the graph and select the area you would like to zoom in by drawing a rectangle. Please see the plots at below:
Figure 10. Main Plots
A. Scatter plot, B. Volcano plot, C. MA plot
Figure 11. Background data
You can hover over the scatterplot points to display more information about the point selected. A few bargraphs will be generated for the user to view as soon as a scatterplot point is hovered over.
Figure 12. main plot hover
A. Hover on Fabp3 gene, B. Read Counts vs Samples, C. Read Counts vs Conditions
You also have a wide array of options when it comes to fold change cut-off levels, padj cut-off values, which comparison set to use, and dataset of genes to analyze.
Figure 13. main plot filters
Figure 14. Selected conditions
After DE analysis, you can always download the results in CSV format by clicking the Download Data button located under the Data Options. You can also download the plot or graphs by clicking on the download button at top of each plot or graph.
For the details please check the user guide at this location: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
DESeq2 performs multiple steps in order to analyze the data you’ve provided for it. The first step is to indicate the condition that each column (experiment) in the table represent. You can group multiple samples into one condition column. DESeq2 will compute the probability that a gene is differentially expressed (DE) for ALL genes in the table. It outputs both a nominal and a multiple hypothesis corrected p-adjusted (padj) value using a negative binomial distribution.
DESeq2 requires count data as input obtained from RNA-Seq or another high-throughput sequencing experiment in the form of matrix values. Here we convert un-integer values to integer to be able to run DESeq2. The matrix values should be un-normalized, since DESeq2 model internally corrects for library size. So, transformed or normalized values such as counts scaled by library size should not be used as input. Please use edgeR or limma for normalized counts.
fitType: Either “parametric”, “local”, or “mean” for the type of fitting of dispersions to the mean intensity. See estimate Dispersions for description.
betaPrior: Whether or not to put a zero-mean normal prior on the non-intercept coefficients See nbinomWaldTest for description of the calculation of the beta prior. By default, the beta prior is used only for the Wald test, but can also be specified for the likelihood ratio test.
testType: Either “Wald” or “LRT”, which will then use either Wald significance tests (defined by nbinomWaldTest), or the likelihood ratio test on the difference in deviance between a full and reduced model formula (defined by nbinomLRT)
rowsum.filter: Regions/Genes/Isoforms with total count (across all samples) below this value will be filtered out
For the details please check the user guide at this location: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Normalization: Calculate normalization factors to scale the raw library sizes. Values can be “TMM”, “RLE”, ”upperquartile”, or ”none”.
Dispersion: Either a numeric vector of dispersions or a character string indicating that dispersions should be taken from the data object.
testType: exactTest or glmLRT. exactTest: Computes p-values for differential abundance for each gene between two samples, conditioning on the total count for each gene. The counts in each group are assumed to follow a binomial distribution. glmLRT: Fits a negative binomial generalized log-linear model to the read counts for each gene and conducts genewise statistical tests.
rowsum.filter: Regions/Genes/Isoforms with total count (across all samples) below this value will be filtered out
For the details please check the user guide at this location: https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf
Limma is a package to analyze of microarray or RNA-Seq data. If data is normalized with spike-in or any other scaling, transformation or normalization method, Limma can be ideal. In that case, prefer limma rather than DESeq2 or EdgeR.
Normalization: Calculate normalization factors to scale the raw library sizes. Values can be “TMM”,”RLE”,”upperquartile”, or ”none”.
Fit Type: Fitting method: “ls” for least squares or “robust” for robust regression
Norm. Bet. Arrays: Normalization Between Arrays; Normalizes expression intensities so that the intensities or log-ratios have similar distributions across a set of arrays.
rowsum.filter: Regions/Genes/Isoforms with total count (across all samples) below this value will be filtered out
Once you’ve selected a specific region on Main Plots (Scatter, Volcano or MA plot), a new heatmap of the selected area will appear just next to your plot. If you want to hide some groups (such as Up, Down or NS based on DE analysis), just click on the group label on the top right part of the figure. In this way, you can select a specific part of the genes by lasso select or box select tools that includes only Up or Down Regulated genes. As soon as you completed your selection, heatmap will be created simultaneously.
Figure 15. main plot selection
A. Box Selection, B. Lasso Selection, C. Created heatmap based on selection
complete: Complete-linkage clustering is one of the linkage method used in hierarchical clustering. In each step of clustering, closest cluster pairs are always merged up to a specified distance threshold. Distance between clusters for complete link clustering is the maximum of the distances between the members of the clusters.
ward D2: Ward method aims to find compact and spherical clusters. The distance between two clusters is calculated by the sum of squared deviations from points to centroids. “ward.D2” method uses criterion (Murtagh and Legendre 2014) to minimize ward clustering method. The only difference ward.D2 and ward is the dissimilarities from ward method squared before cluster updating. This method tends to be sensitive to the outliers.
single: Distance between clusters for single linkage is the minimum of the distances between the members of the clusters.
average: Distance between clusters for average linkage is the average of the distances between the members of the clusters.
mcquitty: mcquitty linkage is when two clusters are joined, the distance of the new cluster to any other cluster is calculated by the average of the distances of the soon to be joined clusters to that other cluster.
median: This is a different averaging method that uses the median instead of the mean. It is used to reduce the effect of outliers.
centroid: The distance between cluster pairs is defined as the Euclidean distance between their centroids or means.
cor: 1 - cor(x) are used to define the dissimilarity between samples. It is less sensitive to the outliers and scaling.
euclidean: It is the most common use of distance. It is sensitive to the outliers and scaling. It is defined as the square root of the sum of the square differences between gene counts.
maximum: The maximum distance between two samples is the sum of the maximum expression value of the corresponding genes.
manhattan: The Manhattan distance between two samples is the sum of the differences of their corresponding genes.
canberra: Canberra distance is similar to the Manhattan distance and it is a special form of the Minkowski distance. The difference is that the absolute difference between the gene counts of the two genes is divided by the sum of the absolute counts prior to summing.
minkowsky: It is generalized form of euclidean distance.
You can also select to view an interactive version of the heatmap by clicking on the ‘Interactive’ checkbox on the left panel under the height and width options. Selecting this feature changes the heatmap into an interactive version with two colors, allowing you to select specific genes to be compared within the GO term plots.
Just like in the Main Plots, you can click and drag to create a selection. To select a specific portion of the heatmap, make sure to highlight the middle of the heatmap gene box in order to fully select a specific gene. This selection can be used later within the GO Term plots for specific queries on your selection.
Figure 16. interactive heatmap
A. Before Selection B. Selection of area with zoom tool C. Zoomed heatmap region which allows better viewing resolution.
Interactive Feature: In order to increase the performance of the generating heatmaps, interactive option is disabled by default. After deciding plotting/clustering parameters of the heatmap, you might activate this feature to investigate each block in detail.
By using Scale Option field on the left sidebar menu, it is possible to adjust scaling parameters of DEBrowser. There are four main options:
heatmap scale
The next tab, ‘GO Term’, takes you to the ontology comparison portion of DEBrowser. From here you can select the standard dataset options such as p-adjust value, fold change cut off value, which comparison set to use, and which dataset to use on the left menu. In addition to these parameters, you also can choose from the 4 different ontology plot options: ‘enrichGO’, ‘enrichKEGG’, ‘Disease’, and ‘compareCluster’. Selecting one of these plot options queries their specific databases with your current DESeq results.
Figure 17. GO plots opts
Your GO plots include:
The types of plots you will be able to generate include:
Summary plot:
Figure 18. go summary
GOdotplot:
Figure 19. go dot plot
Changing the type of ontology to use will also produce custom parameters for that specific ontology at the bottom of the left option panel.
Once you have adjusted all of your parameters, you may hit the submit button in the top right and then wait for the results to show on screen!
The last tab at the top of the screen displays various different data tables. These datatables include: