Report issues on https://github.com/lgeistlinger/EnrichmentBrowser/issues

1 Introduction

The package EnrichmentBrowser implements essential functionality for the enrichment analysis of gene expression data. The analysis combines the advantages of set-based and network-based enrichment analysis to derive high-confidence gene sets and biological pathways that are differentially regulated in the expression data under investigation. Besides, the package facilitates the visualization and exploration of such sets and pathways. The following instructions will guide you through an end-to-end expression data analysis workflow including:

  1. Preparing the data

  2. Preprocessing of the data

  3. Differential expression (DE) analysis

  4. Defining gene sets of interest

  5. Executing individual enrichment methods

  6. Combining the results of different methods

  7. Visualize and explore the results

All of these steps are modular, i.e. each step can be executed individually and fine-tuned with several parameters. In case you are interested in a particular step, you can directly move on to the respective section. For example, if you have differential expression already calculated for each gene, and you are now interested in whether certain gene functions are enriched for differential expression, section 7.2 would be the one you should go for. The last section 9 also demonstrates how to wrap the whole workflow into a single function, making use of suitably chosen defaults.

2 Reading expression data from file

Typically, the expression data is not already available in R but rather has to be read in from a file. This can be done using the function readSE, which reads the expression data (exprs) along with the phenotype data (colData) and feature data (rowData) into a SummarizedExperiment.

library(EnrichmentBrowser)
data.dir <- system.file("extdata", package = "EnrichmentBrowser")
exprs.file <- file.path(data.dir, "exprs.tab")
cdat.file <- file.path(data.dir, "colData.tab")
rdat.file <- file.path(data.dir, "rowData.tab")
se <- readSE(exprs.file, cdat.file, rdat.file)

The man pages provide details on file format and the SummarizedExperiment data structure.

?readSE 
?SummarizedExperiment

Note: Previous versions of the EnrichmentBrowser used the ExpressionSet data structure. The migration to SummarizedExperiment in the current release of the EnrichmentBrowser is done to reflect recent developments in Bioconductor, which discourages the use of ExpressionSet in favor of SummarizedExperiment. The major reasons are the compatibility of SummarizedExperiment with operations on genomic regions as well as efficient handling of big data.

To enable a smooth transition, all functions of the EnrichmentBrowser are still accepting also an ExpressionSet as input, but are consistently returning a SummarizedExperiment as output.

Furthermore, users can always coerce the SummarizedExperiment to ExpressionSet via

eset <- as(se, "ExpressionSet")

and vice versa

se <- as(eset, "SummarizedExperiment")

3 Types of expression data

The two major data types processed by the EnrichmentBrowser are microarray (intensity measurements) and RNA-seq (read counts) data.

Although RNA-seq has become the de facto standard for transcriptomic profiling, it is important to know that many methods for differential expression and gene set enrichment analysis have been originally developed for microarray data.

However, differences in data distribution assumptions (microarray: quasi-normal, RNA-seq: negative binomial) made adaptations in differential expression analysis and, to some extent, also in gene set enrichment analysis if necessary.

Thus, we consider two example data sets – a microarray and an RNA-seq data set, and discuss similarities and differences between the respective analysis steps.

3.1 Microarray data

To demonstrate the package’s functionality for microarray data, we consider expression measurements of patients with acute lymphoblastic leukemia [1]. A frequent chromosomal defect found among these patients is a translocation, in which parts of chromosome 9 and 22 swap places. This results in the oncogenic fusion gene BCR/ABL created by positioning the ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.

We load the ALL dataset

library(ALL) 
data(ALL)

and select B-cell ALL patients with and without the BCR/ABL fusion as described previously [2].

ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
all.eset <- ALL[, sset]

We can now access the expression values, which are intensity measurements on a log-scale for 12,625 probes (rows) across 79 patients (columns).

dim(all.eset) 
#> Features  Samples 
#>    12625       79
exprs(all.eset)[1:4,1:4]
#>              01005    01010    03002    04007
#> 1000_at   7.597323 7.479445 7.567593 7.905312
#> 1001_at   5.046194 4.932537 4.799294 4.844565
#> 1002_f_at 3.900466 4.208155 3.886169 3.416923
#> 1003_s_at 5.903856 6.169024 5.860459 5.687997

As we often have more than one probe per gene, we summarize the gene expression values as the average of the corresponding probe values.

allSE <- probe2gene(all.eset)
head(rownames(allSE))
#> [1] "5595" "7075" "1557" "643"  "1843" "4319"

Note, that the mapping from the probe to gene is done automatically as long as you have the corresponding annotation package, here the hgu95av2.db package, installed. Otherwise, the mapping can be manually defined in the rowData slot.

rowData(se)
#> DataFrame with 1000 rows and 1 column
#>          ENTREZID
#>       <character>
#> 3075         3075
#> 572           572
#> 4267         4267
#> 26             26
#> 51384       51384
#> ...           ...
#> 5295         5295
#> 2966         2966
#> 9140         9140
#> 5558         5558
#> 1956         1956

3.2 RNA-seq data

To demonstrate the functionality of the package for RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone [3].

We load the airway dataset

library(airway) 
data(airway)

For further analysis, we remove genes with very low read counts and measurements that are not mapped to an ENSEMBL gene ID.

airSE <- airway[grep("^ENSG", rownames(airway)),] 
airSE <- airSE[rowSums(assay(airSE)) > 4,]
dim(airSE) 
#> [1] 25133     8
assay(airSE)[1:4,1:4]
#>                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
#> ENSG00000000003        679        448        873        408
#> ENSG00000000419        467        515        621        365
#> ENSG00000000457        260        211        263        164
#> ENSG00000000460         60         55         40         35

4 Normalization

Normalization of high-throughput expression data is essential to make results within and between experiments comparable. Microarray (intensity measurements) and RNA-seq (read counts) data typically show distinct features that need to be normalized for. The function normalize wraps commonly used functionality from limma for microarray normalization and from EDASeq for RNA-seq normalization. For specific needs that deviate from these standard normalizations, the user should always refer to more specific functions/packages.

Microarray data is expected to be single-channel. For two-color arrays, it is expected that normalization within arrays has been already carried out, e.g. using from normalizeWithinArrays from limma.

A default quantile normalization based on normalizeBetweenArrays from limma can be carried out via

allSE <- normalize(allSE, norm.method = "quantile")
par(mfrow=c(1,2))
boxplot(assay(allSE, "raw")) 
boxplot(assay(allSE, "norm"))