Contents

R version: R version 4.0.3 (2020-10-10)

Bioconductor version: 3.12

Package: 1.14.0

Abstract

RNA sequencing (RNA-seq) has become a very widely used technology for profiling gene expression. One of the most common aims of RNA-seq profiling is to identify genes or molecular pathways that are differentially expressed (DE) between two or more biological conditions. This article demonstrates a computational workflow for the detection of DE genes and pathways from RNA-seq data by providing a complete analysis of an RNA-seq experiment profiling epithelial cell subsets in the mouse mammary gland. The workflow uses R software packages from the open-source Bioconductor project and covers all steps of the analysis pipeline, including alignment of read sequences, data exploration, differential expression analysis, visualization and pathway analysis. Read alignment and count quantification is conducted using the Rsubread package and the statistical analyses are performed using the edgeR package. The differential expression analysis uses the quasi-likelihood functionality of edgeR.

Introduction

RNA sequencing (RNA-seq) has become a very widely used technology for profiling transcriptional activity in biological systems. One of the most common aims of RNA-seq profiling is to identify genes or molecular pathways that are differentially expressed (DE) between two or more biological conditions. Changes in expression can then be associated with differences in biology, providing avenues for further investigation into potential mechanisms of action.

This article provides a detailed workflow for analyzing an RNA-seq study all the way from the raw reads through to differential expression and pathway analysis using Bioconductor packages (Huber et al. 2015). The article gives a complete analysis of RNA-seq data that were collected to study the effects of pregnancy and lactation on the luminal cell lineage in the mouse mammary gland (Fu et al. 2015). The pipeline uses the Rsubread package (Liao, Smyth, and Shi 2013) for mapping reads and assigning them to genes and the edgeR package (Robinson, McCarthy, and Smyth 2010) for statistical analyses.

A journal article version of this workflow, including referee reports and responses, is available at F1000Research (Chen, Lun, and Smyth 2016).

RNA-seq analysis involves a number of steps, including read alignment, read summarization, differential expression and pathway analysis. Here we use the Subread aligner (Liao, Smyth, and Shi 2013) for mapping and featureCounts (Liao, Smyth, and Shi 2014) for assigning reads to genes. As well as being fast and efficient, these algorithms have the advantage of having native implementations as R functions in the Rsubread package. This means that the entire analysis can be conducted efficiently within the R environment.

The workflow uses edgeR’s quasi-likelihood pipeline (edgeR-quasi) for differential expression. This statistical methodology uses negative binomial generalized linear models (McCarthy, Chen, and Smyth 2012) but with F-tests instead of likelihood ratio tests (Lund et al. 2012). This method provides stricter error rate control than other negative binomial based pipelines, including the traditional edgeR pipelines (Robinson and Smyth 2008, 2007; McCarthy, Chen, and Smyth 2012) or DESeq2 (Love, Huber, and Anders 2014). The edgeR-quasi pipeline is based on similar statistical methodology to that of the QuasiSeq package (Lund et al. 2012), which has performed well in third-party comparisons (Burden, Qureshi, and Wilson 2014). Compared to QuasiSeq, the edgeR functions offer speed improvements and some additional statistical refinements (Lun, Chen, and Smyth 2016). The RNA-seq pipelines of the limma package also offer excellent error rate control (Law et al. 2014; Ritchie et al. 2015). While the limma pipelines are recommended for large-scale datasets, because of their speed and flexibility, the edgeR-quasi pipeline gives better performance in low-count situations (Lun and Smyth 2014, 2015). For the data analyzed here, the edgeR-quasi, limma-voom and limma-trend pipelines are all equally suitable and give similar results.

The analysis approach illustrated in this article can be applied to any RNA-seq study that includes some replication, but it is especially appropriate for designed experiments with multiple treatment factors and with small numbers of biological replicates. The approach assumes that RNA samples have been extracted from cells of interest under two or more treatment conditions, that RNA-seq profiling has been applied to each RNA sample and that there are independent biological replicates for at least one of the treatment conditions. The Rsubread part of the workflow takes FASTQ files of raw sequence reads as input, while the edgeR part of the pipeline takes a matrix of genewise read counts as input.

Description of the biological experiment

This workflow demonstrates a complete bioinformatics analysis of an RNA-seq study that is available from the GEO repository as series GSE60450. The RNA-seq data were collected to study the lineage of luminal cells in the mouse mammary gland and in particular how the expression profiles of the members of the lineage change upon pregnancy and lactation (Fu et al. 2015). Specifically, the study examined the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary glands of virgin, pregnant and lactating mice. There are therefore six groups of RNA samples, one for each combination of cell type and mouse status. Two biological replicates were collected for each group.

This study used an Illumina Hiseq sequencer to generate about 30 million 100bp single-end reads for each sample. Subread version 1.4.4 (http://subread.sourceforge.net) was used to align the reads to the mouse mm10 genome and featureCounts was used to assign reads to Entrez Genes using RefSeq gene annotation. The FASTQ files containing the raw sequence reads were deposited to the Sequence Read Archive (SRA) repository and the read counts were deposited to GEO.

This experimental design is summarized in the table below, where the basal and luminal cell types are abbreviated with B and L respectively. The GEO and SRA identifiers for each RNA sample are also shown:

> library(RnaSeqGeneEdgeRQL)
> targetsFile <- system.file("extdata", "targets.txt",
+                            package="RnaSeqGeneEdgeRQL")
> targets <- read.delim(targetsFile, stringsAsFactors=FALSE)
> targets
               GEO        SRA CellType    Status
MCL1.DG GSM1480297 SRR1552450        B    virgin
MCL1.DH GSM1480298 SRR1552451        B    virgin
MCL1.DI GSM1480299 SRR1552452        B  pregnant
MCL1.DJ GSM1480300 SRR1552453        B  pregnant
MCL1.DK GSM1480301 SRR1552454        B lactating
MCL1.DL GSM1480302 SRR1552455        B lactating
MCL1.LA GSM1480291 SRR1552444        L    virgin
MCL1.LB GSM1480292 SRR1552445        L    virgin
MCL1.LC GSM1480293 SRR1552446        L  pregnant
MCL1.LD GSM1480294 SRR1552447        L  pregnant
MCL1.LE GSM1480295 SRR1552448        L lactating
MCL1.LF GSM1480296 SRR1552449        L lactating

The experiment can be viewed as a one-way layout with six groups. For later use, we combine the treatment factors into a single grouping factor:

> group <- paste(targets$CellType, targets$Status, sep=".")
> group <- factor(group)
> table(group)
group
B.lactating  B.pregnant    B.virgin L.lactating  L.pregnant    L.virgin 
          2           2           2           2           2           2 

Note. This study isolated carefully sorted cell populations obtained from genetically indentical mice under controlled laboratory conditions. As will be shown during the analysis below, the relatively small sample size (\(n=2\) in each group) here is justified by the low background variability and by the fact that the expression profiles of the different groups are distinctly different. Studies with more variability (for example on human subjects) or with smaller effect sizes may well require more biological replicates for reliable results.

Preliminary analysis

Downloading the read counts

Readers wishing to reproduce the analysis presented in this article can either download the matrix of read counts from GEO or recreate the read count matrix from the raw sequence counts. We will present first the analysis using the downloaded matrix of counts. At the end of this article we will present the R commands needed to recreate this matrix.

The following commands download the genewise read counts for the GEO series GSE60450. The zipped tab-delimited text file GSE60450_Lactation-GenewiseCounts.txt.gz will be downloaded to the working R directory:

> FileURL <- paste(
+   "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE60450",
+   "format=file",
+   "file=GSE60450_Lactation-GenewiseCounts.txt.gz",
+   sep="&")
> download.file(FileURL, "GSE60450_Lactation-GenewiseCounts.txt.gz")

The counts can then be read into a data.frame in R:

> GenewiseCounts <- read.delim("GSE60450_Lactation-GenewiseCounts.txt.gz",
+                              row.names="EntrezGeneID")
> colnames(GenewiseCounts) <- substring(colnames(GenewiseCounts),1,7)
> dim(GenewiseCounts)
[1] 27179    13
> head(GenewiseCounts)
          Length MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA
497097      3634     438     300      65     237     354     287       0
100503874   3259       1       0       1       1       0       4       0
100038431   1634       0       0       0       0       0       0       0
19888       9747       1       1       0       0       0       0      10
20671       3130     106     182      82     105      43      82      16
27395       4203     309     234     337     300     290     270     560
          MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097          0       0       0       0       0
100503874       0       0       0       0       0
100038431       0       0       0       0       0
19888           3      10       2       0       0
20671          25      18       8       3      10
27395         464     489     328     307     342

The row names of GenewiseCounts are the Entrez Gene Identifiers. The first column contains the length of each gene, being the total number of bases in exons and UTRs for that gene. The remaining 12 columns contain read counts and correspond to rows of targets.

The edgeR package stores data in a simple list-based data object called a DGEList. This object is easy to use as it can be manipulated like an ordinary list in R, and it can also be subsetted like a matrix. The main components of a DGEList object are a matrix of read counts, sample information in the data.frame format and optional gene annotation. We enter the counts into a DGEList object using the function DGEList in edgeR:

> library(edgeR)
> y <- DGEList(GenewiseCounts[,-1], group=group,
+              genes=GenewiseCounts[,1,drop=FALSE])
> options(digits=3)
> y$samples
              group lib.size norm.factors
MCL1.DG    B.virgin 23227641            1
MCL1.DH    B.virgin 21777891            1
MCL1.DI  B.pregnant 24100765            1
MCL1.DJ  B.pregnant 22665371            1
MCL1.DK B.lactating 21529331            1
MCL1.DL B.lactating 20015386            1
MCL1.LA    L.virgin 20392113            1
MCL1.LB    L.virgin 21708152            1
MCL1.LC  L.pregnant 22241607            1
MCL1.LD  L.pregnant 21988240            1
MCL1.LE L.lactating 24723827            1
MCL1.LF L.lactating 24657293            1

Adding gene annotation

The Entrez Gene Ids link to gene information in the NCBI database. The org.Mm.eg.db package can be used to complement the gene annotation information. Here, a column of gene symbols is added to y$genes:

> library(org.Mm.eg.db)
> y$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y),
+                          keytype="ENTREZID", column="SYMBOL")
> head(y$genes)
          Length  Symbol
497097      3634    Xkr4
100503874   3259 Gm19938
100038431   1634 Gm10568
19888       9747     Rp1
20671       3130   Sox17
27395       4203  Mrpl15

Entrez Ids that no longer have official gene symbols are dropped from the analysis. The whole DGEList object, including annotation as well as counts, can be subsetted by rows as if it was a matrix:

> y <- y[!is.na(y$genes$Symbol), ]
> dim(y)
[1] 25904    12

Design matrix

Linear modeling and differential expression analysis in edgeR requires a design matrix to be specified. The design matrix records which treatment conditions were applied to each samples, and it also defines how the experimental effects are parametrized in the linear models. The experimental design for this study can be viewed as a one-way layout and the design matrix can be constructed in a simple and intuitive way by:

> design <- model.matrix(~0+group)
> colnames(design) <- levels(group)
> design
   B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
1            0          0        1           0          0        0
2            0          0        1           0          0        0
3            0          1        0           0          0        0
4            0          1        0           0          0        0
5            1          0        0           0          0        0
6            1          0        0           0          0        0
7            0          0        0           0          0        1
8            0          0        0           0          0        1
9            0          0        0           0          1        0
10           0          0        0           0          1        0
11           0          0        0           1          0        0
12           0          0        0           1          0        0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

This design matrix simply links each group to the samples that belong to it. Each row of the design matrix corresponds to a sample whereas each column represents a coefficient corresponding to one of the six groups.

Filtering to remove low counts

Genes that have very low counts across all the libraries should be removed prior to downstream analysis. This is justified on both biological and statistical grounds. From biological point of view, a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be considered biologically important. From a statistical point of view, genes with consistently low counts are very unlikely be assessed as significantly DE because low counts do not provide enough statistical evidence for a reliable judgement to be made. Such genes can therefore be removed from the analysis without any loss of information.

As a rule of thumb, we require that a gene have a count of at least 10 or so in at least some libraries before it is considered to be expressed in the study. In practice, the filtering is actually based on count-per-million (CPM) values so as to avoid favoring genes that are expressed in larger libraries over those expressed in smaller libraries. In edgeR, the filtering can be accomplished using a function that takes into account the library sizes and the experimental design:

> keep <- filterByExpr(y, design)
> table(keep)
keep
FALSE  TRUE 
10226 15678 

The DGEList object is then subsetted to retain only the non-filtered genes:

> y <- y[keep, , keep.lib.sizes=FALSE]

The option keep.lib.sizes=FALSE causes the library sizes to be recomputed after the filtering. This is generally recommended, although the effect on the downstream analysis is usually small.

The filterByExpr function attempts to keep the maximum number of interesting genes in the analysis, but other sensible filtering criteria are also possible. For example keep <- rowSums(y$counts) > 50 is a very simple criterion that would keep genes with a total read count of more than 50. This would give similar downstream results for this dataset to the filtering actually used.

Another sensible filter rule would be to compute the average-log-CPM for each gene, and to choose a cutoff value heuristically:

> AveLogCPM <- aveLogCPM(y)
> hist(AveLogCPM)
Histogram of average log2 CPM.

Figure 1: Histogram of average log2 CPM

Again this might give similar downstream results for this data to the filtering actually used.

Whatever the filtering rule, it should be independent of the information in the targets file. It should not make any reference to which RNA libraries belong to which group, because doing so would bias the subsequent differential expression analysis.

Library size normalization

Normalization by trimmed mean of M values (TMM) (Robinson and Oshlack 2010) is performed by using the calcNormFactors function, which returns the DGEList argument with only the norm.factors changed. It calculates a set of normalization factors, one for each sample, to eliminate composition biases between libraries. The product of these factors and the library sizes defines the effective library size, which replaces the original library size in all downstream analyses.

> y <- calcNormFactors(y)
> y$samples
              group lib.size norm.factors
MCL1.DG    B.virgin 23066768        1.237
MCL1.DH    B.virgin 21608730        1.214
MCL1.DI  B.pregnant 23901234        1.126
MCL1.DJ  B.pregnant 22474773        1.071
MCL1.DK B.lactating 21370542        1.038
MCL1.DL B.lactating 19873971        1.088
MCL1.LA    L.virgin 20188333        1.370
MCL1.LB    L.virgin 21483723        1.368
MCL1.LC  L.pregnant 22053355        1.005
MCL1.LD  L.pregnant 21819338        0.923
MCL1.LE L.lactating 24630306        0.527
MCL1.LF L.lactating 24572846        0.533

The normalization factors of all the libraries multiply to unity. A normalization factor below one indicates that a small number of high count genes are monopolizing the sequencing, causing the counts for other genes to be lower than would be usual given the library size. As a result, the effective library size will be scaled down for that sample. Here we see that the luminal-lactating samples have low normalization factors. This is a sign that these samples contain a number of very highly upregulated genes.

Note. In general, we find TMM normalization to be satisfactory for almost all well-designed bulk mRNA gene expression experiments, although TMMwsp normalization may be preferable if the proportion of zero counts is unusually large. Specialized normalization methods are generally need for single-cell RNA-seq (Lun, Bach, and Marioni 2016) and we do not recommend TMM in that context. Another, less common, type of study requiring special treatment is that with global differential expression, with more than half of the genome differentially expressed between experimental conditions in the same direction (Wu et al. 2013). Global differential expression should generally be avoided in well designed experiments. When it can’t be avoided, then some normalization reference such as spike-ins needs to be built into the experiment for reliable normalization to be done (Risso et al. 2014).

Exploring differences between libraries

The RNA samples can be clustered in two dimensions using multi-dimensional scaling (MDS) plots. This is both an analysis step and a quality control step to explore the overall differences between the expression profiles of the different samples. Here we decorate the MDS plot to indicate the cell groups:

> pch <- c(0,1,2,15,16,17)
> colors <- rep(c("darkgreen", "red", "blue"), 2)
> plotMDS(y, col=colors[group], pch=pch[group])
> legend("topleft", legend=levels(group), pch=pch, col=colors, ncol=2)
MDS plot showing distances between expression profiles. This provides a type of unsupervised clustering of the samples. In this case, the first dimension separates samples by cell type while the second dimension corresponds to mouse status.

Figure 2: MDS plot showing distances between expression profiles
This provides a type of unsupervised clustering of the samples. In this case, the first dimension separates samples by cell type while the second dimension corresponds to mouse status.

In the MDS plot, the distance between each pair of samples can be interpreted as the leading log-fold change between the samples for the genes that best distinguish that pair of samples. By default, leading fold-change is defined as the root-mean-square of the largest 500 log2-fold changes between that pair of samples. The figure above shows that replicate samples from the same group cluster together while samples from different groups are well separated. In other words, differences between groups are much larger than those within groups, meaning that there are likely to be statistically significant differences between the groups. The distance between basal cells on the left and luminal cells on the right is about six units on the x-axis, corresponding to a leading fold change of about 64-fold between the two cell types. The differences between the virgin, pregnant and lactating expression profiles appear to be magnified in luminal cells compared to basal.

The expression profiles of individual samples can be explored more closely with mean-difference (MD) plots. An MD plot visualizes the library size-adjusted log-fold change between two libraries (the difference) against the average log-expression across those libraries (the mean). The following command produces an MD plot that compares sample 1 to an artificial reference library constructed from the average of all the other samples:

> plotMD(y, column=1)
> abline(h=0, col="red", lty=2, lwd=2)
Mean-difference plot of log2-expression in sample 1 versus the average log2-expression across all other samples. Each point represents a gene, and the red line indicates a log-ratio of zero. The majority of points cluster around the red line.

Figure 3: Mean-difference plot of log2-expression in sample 1 versus the average log2-expression across all other samples
Each point represents a gene, and the red line indicates a log-ratio of zero. The majority of points cluster around the red line.

The bulk of the genes are centered around the line of zero log-fold change. The diagonal lines in the lower left of the plot correspond to genes with counts of 0, 1, 2 and so on in the first sample.

It is good practice to make MD plots for all the samples as a quality check. We now look at one of the luminal-lactating samples that was observed have a low normalization factor:

> plotMD(y, column=11)
> abline(h=0, col="red", lty=2, lwd=2)
Mean-difference plot of log2-expression in sample 11 versus the average log2-expression across all other samples. The plot shows a number of genes that are both highly expressed and highly up-regulated.

Figure 4: Mean-difference plot of log2-expression in sample 11 versus the average log2-expression across all other samples
The plot shows a number of genes that are both highly expressed and highly up-regulated.

For this sample, the log-ratios show noticeable positive skew, with a number of very highly upregulated genes. In particular, there are a number of points in the upper right of the plot, corresponding to genes that are both highly expressed and highly up-regulated in this sample compared to others. These genes explain why the normalization factor for this sample is well below one. By contrast, the log-ratios for sample 1 were somewhat negatively skewed, corresponding to a normalization factor above one.

Dispersion estimation

edgeR uses the negative binomial (NB) distribution to model the read counts for each gene in each sample. The dispersion parameter of the NB distribution accounts for variability between biological replicates (McCarthy, Chen, and Smyth 2012). edgeR estimates an empirical Bayes moderated dispersion for each individual gene. It also estimates a common dispersion, which is a global dispersion estimate averaged over all genes, and a trended dispersion where the dispersion of a gene is predicted from its abundance. Dispersion estimates are most easily obtained from the estimateDisp function:

> y <- estimateDisp(y, design, robust=TRUE)

This returns a DGEList object with additional components (common.dispersion, trended.dispersion and tagwise.dispersion) added to hold the estimated dispersions. Here robust=TRUE has been used to protect the empirical Bayes estimates against the possibility of outlier genes with exceptionally large or small individual dispersions (Phipson et al. 2016).

The dispersion estimates can be visualized with plotBCV:

> plotBCV(y)
Scatterplot of the biological coefficient of variation (BCV) against the average abundance of each gene. The plot shows the square-root estimates of the common, trended and tagwise NB dispersions.

Figure 5: Scatterplot of the biological coefficient of variation (BCV) against the average abundance of each gene
The plot shows the square-root estimates of the common, trended and tagwise NB dispersions.

The vertical axis of the plotBCV plot shows square-root dispersion, also known as biological coefficient of variation (BCV) (McCarthy, Chen, and Smyth 2012). For RNA-seq studies, the NB dispersions tend to be higher for genes with very low counts. The dispersion trend tends to decrease smoothly with abundance and to asymptotic to a constant value for genes with larger counts. From our past experience, the asymptotic value for the BCV tends to be in range from 0.05 to 0.2 for genetically identical mice or cell lines, whereas somewhat larger values (\(>0.3\)) are observed for human subjects.

The NB model can be extended with quasi-likelihood (QL) methods to account for gene-specific variability from both biological and technical sources (Lund et al. 2012; Lun, Chen, and Smyth 2016). Under the QL framework, the NB dispersion trend is used to describe the overall biological variability across all genes, and gene-specific variability above and below the overall level is picked up by the QL dispersion. In the QL approach, the individual (tagwise) NB dispersions are not used.

The estimation of QL dispersions is performed using the glmQLFit function:

> fit <- glmQLFit(y, design, robust=TRUE)
> head(fit$coefficients)
       B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
497097      -11.13     -12.01   -11.22       -19.0     -19.03    -19.0
20671       -12.76     -12.51   -12.14       -14.5     -14.30    -14.1
27395       -11.27     -11.29   -11.52       -10.6     -10.86    -10.9
18777       -10.15     -10.20   -10.76       -10.1     -10.38    -10.4
21399        -9.89      -9.73    -9.78       -10.2      -9.96    -10.0
58175       -16.15     -14.85   -15.98       -13.3     -12.28    -12.0

This returns a DGEGLM object with the estimated values of the GLM coefficients for each gene. It also contains a number of empirical Bayes (EB) statistics including the QL dispersion trend, the squeezed QL dispersion estimates and the prior degrees of freedom (df). The QL dispersions can be visualized by plotQLDisp:

> plotQLDisp(fit)