Pathway Expression Profiles (PEPs) are based on the expression of all the pathways (or generic gene sets) belonging to a collection under a given experimental condition, as opposed to individual genes. gep2pep
supports the conversion of gene expression profiles (GEPs) to PEPs and performs enrichment analysis of both pathways and conditions.
gep2pep
creates a local repository of gene sets, which can also be imported from the MSigDB database [1]. The local repository is in the repo
format. When a GEP, defined as a ranked list of genes, is passed to buildPEPs
, the stored database of pathways is used to convert the GEP to a PEP and permanently store the latter.
One type of analysis that can be performed on PEPs and that is directly supported by gep2pep
is the Drug-Set Enrichment Analysis (DSEA, see reference below). It finds pathways that are consistently dysregulated by a set of drugs, as opposed to a background of other drugs. Of course PEPs may refer to non-pharmacological conditions (genetic perturbations, disease states, etc.) for analogous analyses. See the CondSEA
function.
A complementary approach is that of finding conditions under which a set of pathways is consistently UP- or DOWN-regulated. This is the pathway-based version of the Gene Set Enrichment Analysis (GSEA). As an application example, this approach can be used to find drugs mimicking the dysregulation of a gene by looking for drugs dysregulating the pathways involving the gene (this has been published as the gene2drug
tool [3]). See the PathSEA
function.
This vignette uses toy data, as real data can be computationally expensive. Connectivity Map data [4] (drug induced gene expression profiles) pre-converted to PEPs can be downloaded from http://dsea.tigem.it in the gep2pep
format. At the end of this vignette, a precomputed example is reported in which that data is used.
In order to use the package, it must be loaded as follows (the GSEABase package will also be used in this vignette to access gene set data):
The MSigDB is a curated database of gene set collections. The entire database can be downloaded as a single XML file and used by gep2pep
. The following commented code would import the database once downloaded:
However, for this vignette a small excerpt will be used. Gep2pep uses a slight variation of the BroadCollection
type used by MSigDB, named CategorizedCollection
. Thus a conversion is necessary:
The database is in GSEABase::GeneSetCollection
format and includes 30 pathways, each of which is included in one of 3 different collections. Following MSigDB conventions, each collection is identified by a “category” and “subCategory” fields. But the CategorizedCollection type allows them to be arbitrary strings, as opposed to MSigDB categories. This allows for the creation of custom collection with categories. gep2pep
puts together into a single collection identifier using makeCollectionIDs
.
colltypes <- sapply(db, collectionType)
cats <- sapply(colltypes, attr, "category")
subcats <- sapply(colltypes, attr, "subCategory")
print(cats)
## [1] "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3" "c3"
## [15] "c3" "c3" "c3" "c3" "c3" "c3" "c4" "c4" "c4" "c4" "c4" "c4" "c4" "c4"
## [29] "c4" "c4"
print(subcats)
## [1] "TFT" "TFT" "TFT" "TFT" "TFT" "TFT" "TFT" "TFT" "TFT" "TFT" "MIR"
## [12] "MIR" "MIR" "MIR" "MIR" "MIR" "MIR" "MIR" "MIR" "MIR" "CGN" "CGN"
## [23] "CGN" "CGN" "CGN" "CGN" "CGN" "CGN" "CGN" "CGN"
makeCollectionIDs(db)
## [1] "c3_TFT" "c3_TFT" "c3_TFT" "c3_TFT" "c3_TFT" "c3_TFT" "c3_TFT"
## [8] "c3_TFT" "c3_TFT" "c3_TFT" "c3_MIR" "c3_MIR" "c3_MIR" "c3_MIR"
## [15] "c3_MIR" "c3_MIR" "c3_MIR" "c3_MIR" "c3_MIR" "c3_MIR" "c4_CGN"
## [22] "c4_CGN" "c4_CGN" "c4_CGN" "c4_CGN" "c4_CGN" "c4_CGN" "c4_CGN"
## [29] "c4_CGN" "c4_CGN"
In order to build a local gep2pep
repository containing pathway data, createRepository
is used:
repoRoot <- file.path(tempdir(), "gep2pep_data")
rp <- createRepository(repoRoot, db)
## Repo root created.
## Repo created.
## [21:40:22] Storing pathway data for collection: c3_TFT
## [21:40:22] Storing pathway data for collection: c3_MIR
## [21:40:22] Storing pathway data for collection: c4_CGN
The repository is in repo
format (see later in this document how to access the repository directly). However, knowing repo
is not necessary to use gep2pep
. The following lists the contents of the repository, loads the GeneSetCollection
object containing all the TFT database gene sets and finally shows the description of the pathway “E47_01”:
rp
## ID Dims Size
## gep2pep repository 2 20.51 kB
## c3_TFT_sets 10 18.83 kB
## c3_MIR_sets 10 17.87 kB
## c4_CGN_sets 10 7.42 kB
## conditions 1 40 B
TFTsets <- rp$get("c3_TFT_sets")
TFTsets
## GeneSetCollection
## names: AAANWWTGC_UNKNOWN, AAAYRNCTG_UNKNOWN, ..., SP1_01 (10 total)
## unique identifiers: MEF2C, ATP1B1, ..., MEX3B (3156 total)
## types in collection:
## geneIdType: SymbolIdentifier (1 total)
## collectionType: CategorizedCollection (1 total)
description(TFTsets[["E47_01"]])
## [1] "Genes having at least one occurence of the transcription factor binding site V$E47_01 (v7.4 TRANSFAC) in the regions spanning up to 4 kb around their transcription starting sites."
rp$get
is a repo
command, see later in this vignette.
Pathway Expression Profiles (PEPs) are created from Gene Expression Profiles (GEPs) using pathway information from the repository. GEPs must be provided as a matrix with rows corresponding to genes and columns corresponding to conditions (conditions). Genes and conditions must be specified through row and column names respectively. The values must be ranks: for each condition, the genes must be ranked from that being most UP-regulated (rank 1) to that being most DOWN-regulated (rank equal to the number of rows of the matrix).
One well known database that can be obtained in this format is for example the Connectivty Map. A small excerpt (after further processing) is included with the gep2pep
. The excerpt must be considered as a dummy example, as it only includes 500 genes for 5 conditions. It can be loaded as follows:
geps <- loadSampleGEP()
dim(geps)
## [1] 485 5
geps[1:5, 1:3]
## (+)_chelidonine (+)_isoprenaline (+/_)_catechin
## AKT3 87 115 404
## MED6 347 398 33
## NR2E3 373 119 440
## NAALAD2 227 317 377
## CDKN2B-AS1 216 79 21
The GEPs can be converted to PEPs using the buildPEPs
function. They are stored as repository items by the names “category_subcategory”.
buildPEPs(rp, geps)
## [21:40:23] Working on collection: c3_TFT (1/3)
## ===========================================================================
## [21:40:23] Storing pathway expression profiles
## [21:40:23] Storing condition information...
## [21:40:23] Done.
## [21:40:23] Working on collection: c3_MIR (2/3)
## ===========================================================================
## [21:40:23] Storing pathway expression profiles
## [21:40:23] Storing condition information...
## [21:40:23] Done.
## [21:40:23] Working on collection: c4_CGN (3/3)
## ===========================================================================
## [21:40:23] Storing pathway expression profiles
## [21:40:23] Storing condition information...
## [21:40:23] Done.
Each PEP is composed of an Enrichment Score (ES) – p-value (PV) pair associated to each pathway. ESs and PVs are stored in two separated matrices. For each condition, the p-value reports wether a pathway is significantly dysregulated and the sign of the corresponding ES indicates the direction (UP- or DOWN-regulation).
loadESmatrix(rp, "c3_TFT")[1:3, 1:3]
## (+)_chelidonine (+)_isoprenaline (+/_)_catechin
## M3128 -0.3732815 0.4763897 0.2029289
## M11607 0.3762838 0.2299253 0.2882820
## M12599 0.4585062 0.2365145 -0.2372061
loadPVmatrix(rp, "c3_TFT")[1:3, 1:3]
## (+)_chelidonine (+)_isoprenaline (+/_)_catechin
## M3128 0.2267728 0.05876085 0.8885505
## M11607 0.1248594 0.65717027 0.3809141
## M12599 0.4407217 0.98408373 0.9836051
Suppose the stored PEPs correspond to pharmacological perturbations. Then gep2pep
can perform Drug-Set Enrichment Analysis (DSEA, see Napolitano et al., 2016, Bioinformatics). It finds pathways that are consistently dysregulated by a set of drugs, as opposed to a background of other drugs. Of course PEPs may refer to non-pharmacological conditions (genetic perturbations, disease states, etc.) for analogous analyses (Condition-Set Enrichment Analysis, CondSEA). Given a set pgset
of drugs of interest, CondSEA (which in this case is a DSEA) is performed as follows:
pgset <- c("(+)_chelidonine", "(+/_)_catechin")
psea <- CondSEA(rp, pgset)
## [21:40:23] Working on collection: c3_TFT
## [21:40:23] Common conditions removed from bgset
## [21:40:23] Row-ranking collection
## [21:40:23] Computing enrichments
## [21:40:23] done
## [21:40:23] Working on collection: c3_MIR
## [21:40:23] Row-ranking collection
## [21:40:23] Computing enrichments
## [21:40:23] done
## [21:40:23] Working on collection: c4_CGN
## [21:40:23] Row-ranking collection
## [21:40:23] Computing enrichments
## [21:40:23] done
The result is a list of of 2 elements, named “CondSEA” and “details”, the most important of which is the former. Per-collection results can be accessed as follows:
getResults(psea, "c3_TFT")
## ES PV
## M5067 1.0000000 0.2
## M2501 1.0000000 0.2
## M3128 -0.6666667 0.6
## M11607 0.6666667 0.6
## M16694 -0.6666667 0.6
## M14463 0.6666667 0.6
## M14686 -0.6666667 0.6
## M12599 0.5000000 0.9
## M10817 -0.5000000 0.9
## M4831 0.3333333 1.0
In this dummy example the statistical background is made of only 3 GEPs (we added 5 in total), thus, as expected, there are no significant p-values. For the c3_MIR collection, the pathway most UP-regulated by the chosen set of two drugs is M5012, while the most DOWN-regulated is M18759. They are respectively described as:
sets <- rp$get("c3_MIR_sets")
wM5012 <- which(sapply(sets, setIdentifier)=="M5012")
wM18759 <- which(sapply(sets, setIdentifier)=="M18759")
description(sets[[wM5012]])
## [1] "Genes having at least one occurrence of the motif TGCACTG in their 3' untranslated region. The motif represents putative target (i.e., seed match) of human mature miRNAs hsa-miR-148a, hsa-miR-152, hsa-miR-148b (v7.1 miRBase)."
description(sets[[wM18759]])
## [1] "Genes having at least one occurrence of the motif GCACTTT in their 3' untranslated region. The motif represents putative target (i.e., seed match) of human mature miRNAs hsa-miR-17-5p, hsa-miR-20a, hsa-miR-106a, hsa-miR-106b, hsa-miR-20b, hsa-miR-519d (v7.1 miRBase)."
The analysis can be exported in XLS format as follows:
Performing CondSEA
using Pathway Expression Profiles derived from drug-induced gene expression profiles yields Drug Set Enrichment Analysis (DSEA [2]).
A complementary approach to CondSEA is Pathway-Set Enrichment Analysis (PathSEA). PathSEA searches for conditions that consistently dysregulate a set of pathways. It can be seen as a pathway-based version of the popular Gene Set Enrichment Analysis (GSEA). The PathSEA is run independently in each pathway collection.
pathways <- c("M11607", "M10817", "M16694", ## from c3_TFT
"M19723", "M5038", "M13419", "M1094") ## from c4_CGN
w <- sapply(db, setIdentifier) %in% pathways
subdb <- db[w]
psea <- PathSEA(rp, subdb)
## [21:40:23] Working on collection: c3_TFT
## [21:40:23] Common pathway sets removed from bgset
## [21:40:23] Column-ranking collection
## [21:40:23] Computing enrichments
## [21:40:23] done
## [21:40:23] Working on collection: c4_CGN
## [21:40:23] Common pathway sets removed from bgset
## [21:40:23] Column-ranking collection
## [21:40:23] Computing enrichments
## [21:40:23] done
getResults(psea, "c3_TFT")
## ES PV
## (_)_mk_801 0.7142857 0.1666667
## (_)_atenolol 0.7142857 0.1666667
## (+)_isoprenaline 0.5714286 0.4000000
## (+/_)_catechin 0.5714286 0.4000000
## (+)_chelidonine 0.3333333 0.9333333
PathSEA results are analogous to those of CondSEA, but condition-wise. A set of pathways con also be obtained starting from a gene of interest, for example:
pathways <- gene2pathways(rp, "FAM126A")
pathways
## GeneSetCollection
## names: MYOD_01, E47_01, CMYB_01, AAAYWAACM_HFH4_01 (4 total)
## unique identifiers: KCNE1L, FAM126A, ..., MGST1 (2362 total)
## types in collection:
## geneIdType: SymbolIdentifier (1 total)
## collectionType: CategorizedCollection (1 total)
Using a gene to obtain the pathways and performing PathSEA
with drug-induced Pathway Expression Profiles yields “gene2drug” analysis, see the following reference:
Precomputed Pathway Expression Profiles of the Connectivity Map data in the gep2pep format can be downloaded, unpacked and opened as follows:
download.file("http://dsea.tigem.it/data/Cmap_MSigDB_v6.1_PEPs.tar.gz",
"Cmap_MSigDB_v6.1_PEPs.tar.gz")
untar("Cmap_MSigDB_v6.1_PEPs.tar.gz")
rpBig <- openRepository("Cmap_MSigDB_v6.1_PEPs")
Using these data, two kinds of analysis can be performed:
Drug Set Enrichment Analysis, which looks for commond pathways shared by a set of drugs.
Gene2drug analysis, which looks for drugs dysregulating a gene of interest.
The analyses below are not built at runtime with this document and could become outdated.
Drug Set Enrichment Analysis for a set of HDAC inhibitors using the Gene Ontology collections can be performed as follows:
csea <- CondSEA(rpBig, c("scriptaid", "trichostatin_a", "valproic_acid",
"vorinostat", "hc_toxin", "bufexamac"),
collections=c("C5_BP", "C5_MF", "C5_CC"))
## [16:41:40] Working on collection: C5_BP
## [16:41:42] Common conditions removed from bgset
## [16:41:42] Row-ranking collection
## [16:41:48] Computing enrichments
## [16:41:58] done
## [16:41:58] Working on collection: C5_MF
## [16:41:58] Row-ranking collection
## [16:42:00] Computing enrichments
## [16:42:02] done
## [16:42:02] Working on collection: C5_CC
## [16:42:02] Row-ranking collection
## [16:42:03] Computing enrichments
## [16:42:04] done
The following code retrieves information about the top 10 pathways ranked by CondSEA in GO-MF.
library(GSEABase)
setids <- sapply(loadCollection(rpBig, "C5_MF"), setIdentifier)
MFresults <- getResults(csea, "C5_MF")
w <- match(rownames(MFresults)[1:10], setids)
top10 <- loadCollection(rpBig, "C5_MF")[w]
sapply(top10, setName)
## [1] "GO_TRANSCRIPTION_FACTOR_ACTIVITY_PROTEIN_BINDING"
## [2] "GO_TRANSCRIPTION_COACTIVATOR_ACTIVITY"
## [3] "GO_PHOSPHATIDYLCHOLINE_1_ACYLHYDROLASE_ACTIVITY"
## [4] "GO_RETINOIC_ACID_RECEPTOR_BINDING"
## [5] "GO_PRE_MRNA_BINDING"
## [6] "GO_N_ACETYLTRANSFERASE_ACTIVITY"
## [7] "GO_CYTOSKELETAL_PROTEIN_BINDING"
## [8] "GO_PEPTIDE_N_ACETYLTRANSFERASE_ACTIVITY"
## [9] "GO_ACETYLTRANSFERASE_ACTIVITY"
## [10] "GO_HYDROGEN_EXPORTING_ATPASE_ACTIVITY"
Note that 2 main effects of HDAC inhibitors have been correctly identified: regulation of transcription, and alteration of the acetylation/deacetylation homeostasis. The full analysis can be exported to the Excel format with:
A Gene2drug analysis can be performed starting by a gene of interest, for example the TFEB gene. Pathways including the gene are found as follows:
The following code runs the PathSEA analysis on the pathways involving TFEB. Also in this case the analysis is performed on Gene Ontology collections. Note that a warning is thrown as the GO-CC category has no annotation for TFEB (this is ok).
psea <- PathSEA(rpBig, pws, collections=c("C5_BP", "C5_MF", "C5_CC"))
## Warning: [17:17:13] There is at least one selected collections for
## which no pathway has been provided
## [17:17:13] Removing pathways not in specified collections
## [17:17:13] Working on collection: C5_BP
## [17:17:13] Common pathway sets removed from bgset
## [17:17:15] Column-ranking collection
## [17:17:22] Computing enrichments
## [17:17:29] done
## [17:17:29] Working on collection: C5_MF
## [17:17:29] Common pathway sets removed from bgset
## [17:17:29] Column-ranking collection
## [17:17:30] Computing enrichments
## [17:17:32] done
Thus the top 10 drugs causing (or mimicking) TFEB upregulation are:
getResults(psea, "C5_BP")[1:10,]
## ES PV
## loperamide 0.7324720 1.504075e-11
## proadifen 0.7278256 2.079448e-11
## hydroquinine 0.7220082 3.110434e-11
## bepridil 0.6904276 2.616027e-10
## clomipramine 0.6891810 2.839879e-10
## alexidine 0.6741085 7.574142e-10
## digitoxigenin 0.6737685 7.741670e-10
## lanatoside_c 0.6651556 1.342553e-09
## helveticoside 0.6642112 1.425479e-09
## ouabain 0.6631157 1.527949e-09
Note that the top drug, loperamide, has been demonstrated to induce TFEB translocation at very low concentrations [3]. As before, the full analysis can be exported to the Excel format with:
gep2pep
users don’t need to understand how to interact with a gep2pep
repository, however it can be useful in some cases. gep2pep
repositories are in the repo
format (see the repo
package), so they can be accessed as any other repo
repository. However item tags should not be changed, as they are used by gep2pep
to identify data types. Each gep2pep
repository always contains a special project item including repository and session information, which can be shown as follows:
rp$info("gep2pep repository")
## Project name: gep2pep repository
## Description: This repository contains pathway information and possibly Pathway Expression Profiles created with the gep2pep package.
## Resources: c3_TFT_sets
## c3_MIR_sets
## c4_CGN_sets
## conditions
## c3_TFT
## c3_MIR
## c4_CGN
## Platform: x86_64-pc-linux-gnu (64-bit)
## OS: Ubuntu 16.04.4 LTS
## R version: 3.5.0 (2018-04-23)
## Packages: GSEABase 1.42.0
## graph 1.58.0
## annotate 1.58.0
## XML 3.98.1.11
## AnnotationDbi 1.42.0
## IRanges 2.14.0
## S4Vectors 0.18.0
## Biobase 2.40.0
## BiocGenerics 0.26.0
## gep2pep 1.0.0
## knitr 1.20
Poject name and description can be provided when creating the repository (createRepository
), or edited with rp$set("gep2pep repository", newname="my project name", description="my project description")
. The repository will also contain an item for each pathway collection, and possibly an item for each corresponding PEP collection, as in this example:
rp
## ID Dims Size
## gep2pep repository 2 20.51 kB
## c3_TFT_sets 10 18.83 kB
## c3_MIR_sets 10 17.87 kB
## c4_CGN_sets 10 7.42 kB
## conditions 4 162 B
## c3_TFT 2 1.04 kB
## c3_MIR 2 1.06 kB
## c4_CGN 2 1.05 kB
In order to look at the space that the repository is using, the following command can be used:
put
, set
and other repo
commands can be used to alter repository contents directly, however this could leave the repository in an inconsistent state. The following code checks a repository for possible problems:
checkRepository(rp)
## [21:40:23] Checking repository consistency...
## Checking: gep2pep repository... ok.
## Checking: c3_TFT_sets... ok.
## Checking: c3_MIR_sets... ok.
## Checking: c4_CGN_sets... ok.
## Checking: conditions... ok.
## Checking: c3_TFT... ok.
## Checking: c3_MIR... ok.
## Checking: c4_CGN... ok.
##
## Checking for extraneous files in repo root... ok.
##
## [21:40:23] Checing for gep2pep data consistency...
##
## [21:40:23] Checking conditions list...
## [21:40:23] Ok.
## [21:40:23] Checking PEPs...
## [21:40:23] Checking collection: c3_TFT...
## [21:40:23] ok.
## [21:40:23] Checking collection: c3_MIR...
## [21:40:23] ok.
## [21:40:23] Checking collection: c4_CGN...
## [21:40:23] ok.
## [21:40:23] Summary of common conditions across collections:
## c3_TFT c3_MIR c4_CGN
## c3_TFT 5 5 5
## c3_MIR 5 5 5
## c4_CGN 5 5 5
The last check (summary of commond conditions) ensures that the same conditions have been computed for all the pathway collections, which is however not mandatory.
Additional methodological help can be found at:
[1] Subramanian A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102, 15545-15550 (2005).
[2] Napolitano F. et al, Drug-set enrichment analysis: a novel tool to investigate drug mode of action. Bioinformatics 32, 235-241 (2016).
[3] Napolitano F. et al, gene2drug: a Computational Tool for Pathway-based Rational Drug Repositioning, bioRxiv (2017) 192005; doi: https://doi.org/10.1101/192005
[4] Lamb, J. et al. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 313, 1929-1935 (2006).
Built with gep2pep
version 1.0.0