0.1 Example six-step outline

This section provides a recommended pipeline for researchers who are using the bigPint package to visualize RNA-seq data. Researchers may chose to tailor this suggested pipeline accordingly. For instance, if a user is investigating an RNA-seq dataset with a very large number of samples, the scatterplot matrix may be difficult to use on all samples at once. As a result, the user can simply not use the scatterplot matrix or may choose to perform the scatterplot matrix on random subsets of samples.

The first two steps should be performed on the normalized count table (i.e. Data object) before DEG designation:

1. Create a side-by-side boxplot on full data (before DEG designation)

  • Check that five number summary is consistent across samples. Deviation may indicate that the data require a different normalization technique.

2. Create scatterplot matrix on full data (before DEG designation)

  • Check that most genes fall along the x=y line. Deviation may indicate that the data require a different normalization technique.
  • Check that treatment scatterplots have larger variability than replicate scatterplots. If the reverse is seen in some scatterplots, then samples may have been mislabeled.
  • Check for strange geometric features. Streaks of outliers in several scatterplots may require specific normalization techniques. Streaks of outliers in replicate scatterplots may capture genes that were inadvertently differentially expressed due to unintended differences in replicates.

Once the normalized count table (i.e. Data object) passes the first two steps and any inadequate normalization and/or questionable patterns have been accounted for, then the user should apply a model through packages like edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), or limma (Ritchie et al. 2015)) to obtain statistical (e.g. p-value) and quantitiatve (e.g. log fold change) values for each gene in the dataset. After that, the user should continue with the last four steps. For these steps, you will need the normalized count table (i.e. Data object) and the DEG designation table (i.e. Data metrics object).

3. Perform hierarchical clustering and plot the standardized parallel coordinate lines for each cluster of significant genes.

  • Check that parallel coordinate lines appear as DEGs should. Whole clusters may show questionable DEG calls or normalization issues.

4. Plot raw and standardized scatterplot matrices for each cluster of significant genes.

  • Check that DEG points overlaid on scatterplot matrix appear as DEGs should. Whole clusters may show questionable DEG calls or normalization issues. The standardized version may be helpful to magnify subtle patterns.

5. Plot raw and standardized litre plots for each cluster of significant genes.

  • Flip through litre plots for DEG points overlaid on litre plots and check that they appear as DEGs should. Whole clusters may show questionable DEG calls or normalization issues. The standardized version may be helpful to magnify subtle patterns.

6. Plot DEGs onto volcano plot.

  • Determine whether DEGs are not only statistically significant, but also have large magnitude changes.

0.2 Step 1: Side-by-side boxplot

First, we will read in our example data, soybean_cn_sub. This is public RNA-seq data derived from soybean cotyledon at different time points. We will consider two treatment groups (S1 and S3) that each have three replicates (Brown and Hudson 2015). We will refer to this count table as data. Check that your data object is in the format bigPint expects by refering to the article (i.e. Data object). An example of the data structure is shown below.

library(bigPint)
library(dplyr)
library(ggplot2)
library(plotly)
data("soybean_cn_sub")
data = soybean_cn_sub %>% select(ID, starts_with("S1"), starts_with("S3"))
str(data, strict.width = "wrap")
#> 'data.frame':    7332 obs. of  7 variables:
#> $ ID : chr "Glyma06g12670.1" "Glyma08g12390.2" "Glyma12g02076.11"
#>    "Glyma18g53450.1" ...
#> $ S1.1: num 0.802 4.769 3.19 0.802 3.19 ...
#> $ S1.2: num 2.709 5.236 2.902 0.802 2.48 ...
#> $ S1.3: num 1.763 5.167 2.907 0.802 3.262 ...
#> $ S3.1: num 8.557 3.669 3.364 0.802 4.4 ...
#> $ S3.2: num 8.368 4.031 2.731 0.802 3.42 ...
#> $ S3.3: num 8.389 4.269 3.256 0.802 4.144 ...

We will also create a standardized version of this count table, which we will refer to as data_st. In the standardized case, each gene will have a mean of zero and a standard deviation of one across its samples (Chandrasekhar, Thangavel, and Elayaraja 2012).

data_st <- as.data.frame(t(apply(as.matrix(data[,-1]), 1, scale)))
data_st$ID <- as.character(data$ID)
data_st <- data_st[,c(length(data_st), 1:length(data_st)-1)]
colnames(data_st) <- colnames(data)
nID <- which(is.nan(data_st[,2]))
data_st[nID,2:length(data_st)] <- 0
str(data_st, strict.width = "wrap")
#> 'data.frame':    7332 obs. of  7 variables:
#> $ ID : chr "Glyma06g12670.1" "Glyma08g12390.2" "Glyma12g02076.11"
#>    "Glyma18g53450.1" ...
#> $ S1.1: num -1.158 0.386 0.533 0 -0.42 ...
#> $ S1.2: num -0.644 1.121 -0.633 0 -1.44 ...
#> $ S1.3: num -0.899 1.012 -0.615 0 -0.317 ...
#> $ S3.1: num 0.933 -1.345 1.241 0 1.318 ...
#> $ S3.2: num 0.8816 -0.7748 -1.3259 0 -0.0904 ...
#> $ S3.3: num 0.887 -0.4 0.8 0 0.95 ...

Next, we will generate a side-by-side boxplot of this data. We can check that the distribution of read counts (their five number summaries) is consistent across the six samples. If a user does not find that the side-by-side boxplots show consistent read count distributions across the samples, then they may wish to renormalize and/or remove outliers, using packages like edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), or limma (Ritchie et al. 2015)).

ret <- plotPCP(data=data, saveFile = FALSE)
ret[["S1_S3"]]


0.3 Step 2: Scatterplot matrix

Examining the normalized count table as a scatterplot matrix can help us check for normalization problems, strange geometric features that need to be considered, and problems in variability between replicates and treatments. Unlike the side-by-side boxplot, it allows us to investigate each individual gene in the data. We confirm in our plot below that:

  • The nine treatment scatterplots show more variability around the x=y line than the six replicate scatterplots
  • The data appears normalized (the genes center on the x=y line in each scatterplot)
  • There are no unexpected geometric features or outlier streaks.
ret <- plotSM(data=data, saveFile = FALSE)
ret[["S1_S3"]]

Our normalized data appeared as expected in the scatterplot matrix above, but we briefly show three example cases users can look out for when using the scatterplot matrix to detect issues in their data. The first case is when the variability around the x=y line is not as expected for the treatment versus replicate scatterplots. Notice how the scatterplot below has nine scatterplots with large variation around their x=y line and six scatterplots with little variation around their x=y line? These are the numbers we expect, but we expect the nine scatterplots with large variation to all be located in the bottom-left corner of the matrix as they should belong to the treatment scatterplots. In an example like this, the user may wish to double-check that samples were not switched to cause such unexpected trends in sample variability. In fact, for didactic purposes, we deliberately switched samples S1.3 and S3.1.

dataSwitch <- data[,c(1:3, 5, 4, 6:7)]
colnames(dataSwitch) <- colnames(data)
ret <- plotSM(data=dataSwitch, saveFile = FALSE)
ret[["S1_S3"]]

The second case occurs when the scatterplot matrix reveals normalization issues. The plot below shows one such example from a public dataset of Saccharomyces cerevisiae (yeast) grown in YPGlucose (YPD) (Risso et al. 2011). The deviation of genes from the x=y line instantly reveals that the RNA-seq dataset was not thoroughly normalized using within-lane normalization (subplot A). However, within-lane normalization followed by between-lane normalization sufficiently normalized the data (subplot B). This example shows that the user can work iteratively between graphics and models: They can update model parameters and normalization techniques until their visualizations show that the model now makes sense.