1 Introduction

In comparative RNA sequencing (RNA-seq) experiments, selecting the appropriate sample size is an important optimization step [1]. Empirical RNA-seq Sample Size Analysis (ERSSA) is a R software package designed to test whether an existing RNA-seq dataset has sufficient biological replicates to detect a majority of the differentially expressed genes (DEGs) between two conditions. In contrast to existing RNA-seq sample size analysis algorithms, ERSSA does not rely on any a priori assumptions about the data [2]. Rather, ERSSA takes a user-supplied RNA-seq sample set and evaluates the incremental improvement in identification of DEGs with each increase in sample size up to the total samples provided, enabling the user to determine whether sufficient biological replicates have been included.

Based on the number of replicates available (N for each of the two conditions), the algorithm subsamples at each step-wise replicate levels (n= 2, 3, 4, …, N-1) and uses existing differential expression (DE) analysis software (e.g., edgeR [8] and DESeq2 [9]) to measure the number of DEGs. As N increases, the set of all distinct subsamples for a particular n can be very large and experience with ERSSA shows that it is not necessary to evaluate the entire set. Instead, 30-50 subsamples at each replicate level typically provide sufficient evidence to evaluate the marginal return for each increase in sample size. If the number of DEGs identified is similar for n = N-2, N-1 and N, there may be little to be gained by analyzing further replicates. Conversely, if the number of DEGs identified is still increasing strongly as n approaches N, the user can expect to identify significantly more DEGs if they acquire additional samples.

When applied to a diverse set of RNA-seq experimental settings (human tissue, human population study and in vitro cell culture), ERSSA demonstrated proficiency in determining whether sufficient biological replicates have been included. Overall, ERSSA can be used as a flexible and easy-to-use tool that offers an alternative approach to identify the appropriate sample size in comparative RNA-seq studies.

2 Installation

Install the latest stable version of ERSSA by entering the following commands in R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ERSSA")

3 Usage

3.1 Utility

In this vignette, we demonstrate ERSSA’s analytical approach using an RNA-seq dataset containing 10 human heart samples and 10 skeletal muscle samples from GTEx [3] and ask whether 10 replicates are sufficient to identify a majority of DE genes (adjusted p-value < 0.05 and |log2(fold-change)| > 1). At the end of the ERSSA run, four plots are generated to summarize the results. For now, let’s briefly focus on the most important of these, the number of DEGs identified as a function of the number of replicates included in the analysis. In the present example, the average number of DEGs discovered increases approximately 3% from n=6 to n=7 and little to no improvement as n increases to 8 and beyond. This suggests that our example dataset with N=10 replicates is sufficient to identify the vast majority of DEGs. To verify this conclusion, an additional 15 human heart and 15 skeletal muscle samples from GTEx were added and the analysis was repeated with N=25. The results for n<10 obtained with N=25 gave similar mean and distribution of the number of DEGs identified as those obtained with N=10, validating the utility of the statistical subsampling approach. The rest of this vignette will further explore ERSSA’s functionalities using the 10-replicate GTEx heart vs. muscle dataset. We will also briefly go through two additional examples that help to illustrate the variety of experimental settings where ERSSA can be applied.

3.2 Load example data

First, let’s load the N=10 GTEx heart and muscle dataset into the R workspace. The data can be loaded into R directly from the ERSSA package using:

library(ERSSA)

# GTEx dataset with 10 heart and 10 muscle samples
# "full"" dataset contains all ensembl genes
data(condition_table.full)
data(count_table.full)

# For test purposes and faster run time, we will use a smaller "partial" dataset
# 4 heart and 4 muscle samples
# partial dataset contains 1000 genes
data(condition_table.partial)
data(count_table.partial)

# NOTE: the figures are generated from the "full" dataset

The original dataset was obtained from the recount2 project [6], which is a systematic effort to generate gene expression count tables from thousands of RNA-seq studies. To generate the objects loaded above, GTEx heart and muscle count tables were manually downloaded from the recount2 website and processed by the recount package. The first 10 samples were then selected and a corresponding condition table generated to complete this example dataset.

For any ERSSA analysis, we need a few essential inputs. First is the RNA-seq count table that contains genes on each row and samples on each column. ERSSA expects the input count table as a dataframe object with gene names as the index and sample IDs as column names. For example, the first few cells of our example count table looks like this:

head(count_table.full[,1:2])
##                    heart_SRR598148.txt heart_SRR598509.txt
## ENSG00000000003.14                 147                 153
## ENSG00000000005.5                    0                   5
## ENSG00000000419.12                 333                 222
## ENSG00000000457.13                 226                  95
## ENSG00000000460.16                 103                  64
## ENSG00000000938.12                 835                1093

Next, we need to supply a condition table in the form of a dataframe object that contains two columns and the same number of rows as the total number of samples. Column one contains the sample IDs exactly as they appear in the count tables and column two contains the corresponding condition names. Only two conditions are supported. Our full condition table is shown below:

condition_table.full
##                    name condition
## 1   heart_SRR598148.txt     heart
## 2   heart_SRR598509.txt     heart
## 3   heart_SRR598589.txt     heart
## 4   heart_SRR599025.txt     heart
## 5   heart_SRR599086.txt     heart
## 6   heart_SRR599249.txt     heart
## 7   heart_SRR599380.txt     heart
## 8   heart_SRR600474.txt     heart
## 9   heart_SRR600829.txt     heart
## 10  heart_SRR600852.txt     heart
## 11 muscle_SRR598044.txt    muscle
## 12 muscle_SRR598452.txt    muscle
## 13 muscle_SRR600656.txt    muscle
## 14 muscle_SRR600981.txt    muscle
## 15 muscle_SRR601387.txt    muscle
## 16 muscle_SRR601671.txt    muscle
## 17 muscle_SRR601695.txt    muscle
## 18 muscle_SRR601815.txt    muscle
## 19 muscle_SRR602010.txt    muscle
## 20 muscle_SRR603116.txt    muscle

Finally, we need to specify which condition to use as the control in the DE comparison. In this case, let’s set “heart” as the control condition.

3.3 Run ERSSA

With the count and condition tables prepared, we can now call the main erssa wrapper function to start the sample size analysis:

set.seed(1) # for reproducible subsample generation

ssa = erssa(count_table.partial, condition_table.partial, DE_ctrl_cond='heart')

# Running full dataset is skipped in the interest of run time
# ssa = erssa(count_table.full, condition_table.full, DE_ctrl_cond='heart')

With this command, the erssa wrapper function calls various ERSSA functions to perform the following calculations in sequence:

  1. Filter the count table by gene expression level to remove low-expressing genes.

  2. Generate unique subsamples (sample combinations) at each replicate level.

  3. Call one of the DE packages to perform DE analysis. Identify the genes that pass test statistic and fold change cutoffs.

  4. Generate plots to visualize the results.

By default, the erssa wrapper function will save the generated plots plus all of the generated data in the current working directory. An alternative path can be set using the path argument.

Note that, under default setting, the ERSSA calculations may require runtime in order of minutes. This is because for each subsample, a DE analysis is performed by calling the DE software. Thus, runtime is scaled linearly to the number of calls to the DE software and when hundreds of comparisons (in our example dataset: 8 replicate levels * 30 subsamples per level) need to be done, the entire calculation can take some time to complete. Fortunately, this issue can be mitigated by running the DE calculations in parallel (ERSSA uses the BiocParallel package to manage this [7]). This along with other ERSSA capabilities are further explained in the next section.

3.4 ERSSA in more detail

In this section, the steps erssa take are explained in more detail. Additionally, we will also cover the parameters that can be set to optimize the analysis for each user’s specific needs. Full descriptions and usage examples can be found in each function’s manual.

3.4.1 Filter count table

First, erssa calls the function count_filter to remove low-expressing genes from the count table. Filtering away low-expressing genes before differential expression comparison is a widely accepted practice and should be performed here to maximize discovery [4]. A gene-wise average Count Per Million (CPM) calculation is performed and at default, all genes with average CPM below 1 are removed from further analysis. The default CPM cutoff can be changed by adjusting the argument filter_cutoff. Additionally, if the user prefers to perform their own gene filtering prior to ERSSA (e.g., with the genefilter package [5]), a pre-filtered count table can be supplied and gene filter by ERSSA turned off by setting the argument counts_filtered=TRUE.

3.4.2 Generate subsample combinations

Next, ERSSA runs comb_gen function to randomly generate the subsample combinations. Briefly, at each replicate level (n=2 to N-1), this function employs a random process to sample from the entire dataset to generate at most 30 (at default) unique subsample combinations per condition. Note that only unique sample combinations are kept; so in the case where we select 5 samples out of 6 total replicates, only 6 unique combinations will be generated. Finally, at most 30 unique pairs of control vs. experimental samples are randomly selected for DE analysis by combining the lists from both conditions.

The number of subsample combinations per replicate level can be set with the comb_gen_repeat argument. The default value of 30 is based on the observation that for a majority of datasets we have analyzed, 30 combinations proved sufficient to expose the trend in the DE gene discovery as a function of replicate level n. However, we also noticed that for certain datasets with particularly high biological variance, ERSSA benefits from running additional combinations at the expense of longer runtime.

Since a random process is employed, each individual erssa run will generate an unique set of subsamples as long as all of the unique combinations have not been exhausted as set by the comb_gen_repeat argument. However, deterministic results can be achieved by setting the random seed through set.seed(seed) before running erssa. For example, we can exactly reproduce the plot in section 3.1 using set.seed(1). Generally, we found it useful to run several random seeds to confirm the overall conclusions are consistent across individual runs.

3.4.3 Start DE analysis

For each subsample (selected pair of control and experiment combinations generated), ERSSA calls a DE software such as edgeR or DESeq2 to perform the DE analysis. At default, edgeR is used as it is slightly faster in runtime compare to DESeq2. Alternatively, DESeq2 can be used instead of edgeR by including the argument: DE_software='DESeq2'. The analysis is done under the hood by either erssa_edger or erssa_deseq2 function. For now, only edgeR and DESeq2 are supported as they are two of the most widely used DE software. Additional DE packages can be added in the future.

As previously noted, ERSSA runtime can be significantly shortened by running the DE comparisons in parallel. To do this, ERSSA relies on the BiocParallel package with the number of workers (CPUs) set using the num_workers argument. Running parallel DE tests in ERSSA is handled by erssa_edger_parallel or erssa_deseq2_parallel functions.

One of the main goals of ERSSA is to identify the number of DE genes in each of the sample combinations. At default, the genes with adjusted p-value (or FDR) < 0.05 and |log2(fold-change)| > 1 are considered to be differentially expressed. Alternatively, the user can specify a more stringent p-value cutoff using the DE_cutoff_stat argument. Likewise, the |log2(fold-change)| cutoff can be set with the DE_cutoff_Abs_logFC argument. The latter may be necessary when the expected gene expression differences between the conditions are relatively small (e.g., weak stimulation of cells compare to control cells).

Once the DE genes have been identified, the DE analysis tables of results such as fold change, test statistics, etc., are not saved in an effort to conserve disk space. If one wishes to retain these results, the tables can be saved as csv files in a new folder in the path specified by setting DE_save_table=TRUE.

3.4.4 Plot results

Based on the DE analyses, ERSSA generates four summary plots to help the user interpret the results. The first plot (generated by ggplot2_dotplot function) displays the number of DEGs discovered in each differential comparison grouped by the number of replicates employed. Boxplots are used to summarize the data while a red solid line represents the mean number of DEGs across the replicate levels. Lastly, a blue dashed line indicates the number of DEGs discovered with the full dataset.