1 Introduction

Proteins at the cell surface connect intracellular and extracellular signaling networks and largely determine a cell’s capacity to communicate and interact with its environment.

Importantly, variations in transcriptomic profiles are often observed between healthy and diseased cells, presenting distinct sets of cell-surface proteins. Indeed, cell surface proteins i) may act as biomarkers for the detection of diseased cells in tissues or body fluids and ii) are the most prevalent target of human drugs: 66% of approved human drugs listed in the DrugBank database target a cell-surface protein. The investigation of the cell surfaceome therefore could provide new possibilities for diagnosis, prognosis, treatment development, and therapy response evaluation.

When a study aims to find new biomarkers, the small number of samples often limits the ability to obtain reliable results. However, as sequencing costs continue to decrease, several follow-up experiments will likely be conducted to re-address the same biological question, suggesting a need for methods able to jointly analyze data from multiple studies.

SurfR provides a solution to these issues, generating a list of ranked surface protein-coding differentially-expressed genes starting from the raw count matrix of your own RNA-seq experiment or from bulk transcriptomic data automatically retrieved from public databases such as GEO and TCGA. GEO queries are based on the ArchS4 pipeline. TCGA repository is interrogated through TCGAbiolinks.

SurfR also offers the opportunity to increase the sample size of a cohort by integrating related datasets, therefore enhancing the power to detect differentially expressed genes of interest. Meta-analysis can be performed through metaRNASeq, taking into account inter-study variability that may arise from technical differences among studies (e.g., sample preparation, library protocols, batch effects) as well as additional biological variability.

Gene ontology (GO) and pathway annotation can also be performed within SurfR to gain further insights about surface protein candidates.

Finally, SurfR includes functions to visualize DEG and enrichment results, aiding in the interpretation of findings.

2 Installation

Install the package from Bioconductor or GitHub, ensuring correct Bioconductor dependencies.

if (!"BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager", repos = "https://cran.r-project.org")

When the package is available on Bioconductor, use

BiocManager::install("SurfR")

Use the pre-release or devel version from GitHub using devtools with

# install.packages("devtools")
devtools::install_github("auroramaurizio/SurfR", build_vignettes = TRUE)

3 Quick Start

The basic idea behind SurfR has been to create a complete framework to detect surface protein coding genes within your data, or within public datasets. This framework facilitates the simultaneous analysis and comparison of multiple studies, easily revealing functional consensus and differences among distinct conditions. To begin, let’s ask SurfR to detect surface protein coding genes among a vector of input genes. Gene ID can be provided as gene_name, ensembl, entrez or uniProt_name.

The protein classification is based on a recently developed surfaceome predictor, called SURFY (Bausch-Fluck et al. 2018), based on machine learning.

library(SurfR)
#> 

GeneNames <- c("CIITA", "EPCAM", "CD24", "CDCP1", "LYVE1")
SurfaceProteins_df <- Gene2SProtein(GeneNames, 
                                    input_type = "gene_name", 
                                    output_tsv = FALSE,
                                    Surfy_version = "new")
#> 4 out of 5 genes have a matching surface protein
#The output dataframe contains several information retrieved from Surfy.
colnames(SurfaceProteins_df)
#>  [1] "UniProt.name"                                 
#>  [2] "UniProt.accession"                            
#>  [3] "UniProt.description"                          
#>  [4] "UniProt.gene"                                 
#>  [5] "Surfaceome.Label"                             
#>  [6] "Surfaceome.Label.Source"                      
#>  [7] "Comment"                                      
#>  [8] "length"                                       
#>  [9] "TM.domains"                                   
#> [10] "signalpeptide"                                
#> [11] "topology"                                     
#> [12] "topology.source"                              
#> [13] "MachineLearning.trainingset"                  
#> [14] "MachineLearning.score"                        
#> [15] "MachineLearning.FPR.class.(1=1%,.2=5%,.3=15%)"
#> [16] "Ensembl.gene"                                 
#> [17] "Ensembl.protein"                              
#> [18] "CD.number"                                    
#> [19] "Membranome.Almen.main-class"                  
#> [20] "Membranome.Almen.sub-class"                   
#> [21] "nxst.motifs"                                  
#> [22] "noncyt..nxst.count"                           
#> [23] "peps.with.accessible.noncyt..nxst"            
#> [24] "noncyt..Trp.count"                            
#> [25] "peps.with.accessible.noncyt..Trp"             
#> [26] "noncyt..Tyr.count"                            
#> [27] "peps.with.accessible.noncyt..Tyr"             
#> [28] "glycomineN.sites"                             
#> [29] "glycomineO.sites"                             
#> [30] "glycomineC.sites"                             
#> [31] "CSPA.category"                                
#> [32] "CSPA.peptide.count"                           
#> [33] "CSPA.peptides"                                
#> [34] "CSPA.N115.sites"                              
#> [35] "CSPA.id"                                      
#> [36] "UniProt.subcellular"                          
#> [37] "UniProt.keywords"                             
#> [38] "UniProt.uniref"                               
#> [39] "COMPARTMENTS.link"                            
#> [40] "COMPARTMENTS.benchmark.pos"                   
#> [41] "COMPARTMENTS.benchmark.neg"                   
#> [42] "HPA.antibody"                                 
#> [43] "DrugBank.approved.drug.IDs"                   
#> [44] "GeneID"                                       
#> [45] "entrezID"
#These are the 5 surface protein coding genes detected by SurfR.
SurfaceProteins_df$UniProt.gene
#> [1] "EPCAM" "CD24"  "CDCP1" "LYVE1"

4 Tutorial

4.1 Start from your own data

Although SurfR provides many functions to retrieve public data you can always start from your own dataset.

Here we are going to simulate a small bulkRNA dataset with the R package SPsimSeq (Assefa, Vandesompele, and Thas 2020) starting from a subset of Zhang RNA-seq data (Zhang et al. 2015), adding 20% of Differentially Expressed genes (pDE = 0.2).

You can than decide to stick to it or combine it with other datasets (public or private).

library(SPsimSeq)

data("zhang.data.sub")
zhang.counts <- zhang.data.sub$counts
MYCN.status  <- zhang.data.sub$MYCN.status

# Simulation of bulk RNA data
sim.data.bulk <- SPsimSeq(n.sim = 1, 
                          s.data = zhang.counts,
                          group = MYCN.status, 
                          n.genes = 1000, 
                          batch.config = 1,
                          group.config = c(0.5, 0.5), 
                          tot.samples = ncol(zhang.counts),
                          pDE = 0.2, 
                          lfc.thrld = 0.5, 
                          result.format = "list",
                          return.details = TRUE)

sim.data.bulk1 <- sim.data.bulk$sim.data.list[[1]]

countMatrix <- sim.data.bulk1$counts
row.names(countMatrix) <- row.names(zhang.counts)
metadata <- sim.data.bulk1$colData
metadata$Group <- as.factor(metadata$Group)

A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2 (Love, Huber, and Anders 2014), starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.

We use the built-in SurfR DGE function, to perform the differential expression analysis of the simulated dataset.

Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).

library(SurfR)

df_zhang <- DGE(expression = countMatrix,
                metadata = metadata,
                Nreplica = 50,
                design = "~Group",
                condition = "Group",
                alpha = 0.05,
                TEST = "1", CTRL =  "0", 
                output_tsv = FALSE)

head(df_zhang)

Once DEGS have been detected, we may want to isolate Surface protein-coding genes.

The protein classification is based on a recently developed surfaceome predictor, called SURFY, based on machine learning.

We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).

# remove NA values
df_zhang <- df_zhang[!is.na(df_zhang$padj), ]

# all fdr
fdr_GeneID <- df_zhang[df_zhang$padj < 0.05, "GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")

# upregulated fdr
fdrUP_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange > 0,
                         "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")

# dowregulated fdr
fdrDW_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange < 0,
                         "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")

4.2 Explore public datasets

4.2.1 Dowload from GEO (Gene Expression Omnibus)

GEO (Edgar 2002) is a public functional genomics data repository containing high throughput gene expression data and hybridization arrays. We provide a handy interface to download experiments and curated gene expression profiles.

Here we are going to reanalyze the Cloughesy et al. (Cloughesy et al. 2019) public dataset of recurrent glioblastoma patients undergoing two different treatments: neoadjuvant pembrolizumab or adjuvant pembrolizumab..

This study is available under the GEO accession series GSE121810.

The metadata is downloaded with SurfR built-in function GEOmetadata. Note that this study has been sequenced with only one sequencing platform. If this is not the case, you have to download separately all the metadata specifying the GPL series numbers, and then merge them.

The count matrix is downloaded from ArchS4 (Lachmann et al. 2018) with the SurfR built-in function DownloadArchS4, to ensure handling not normalize data.

library(SurfR)
library(stringr)

# Download metadata from GEO
mGSE121810 <- GEOmetadata(GSE = "GSE121810")

# create new metadata column in order to remove unwanted special characters
unwanted_character <- " "
fx <- function(x) {
  str_split(string = x, pattern = unwanted_character)[[1]][1]
}
mGSE121810$condition <- sapply(mGSE121810$therapy, fx)
mGSE121810$condition <- as.factor(mGSE121810$condition)

# Preview metadata
head(mGSE121810)

# only select 3 samples per condition to save time
na_samples <- c("GSM3447013", "GSM3447018", "GSM3447019")
a_samples <- c("GSM3447023", "GSM3447024", "GSM3447026")
mGSE121810 <- mGSE121810[c(na_samples, a_samples), ]

# Download count matrix from ArchS4
cGSE121810 <- DownloadArchS4(mGSE121810$GSM, 
                             species = "human", 
                             print_tsv = FALSE, 
                             filename = NULL)

# Preview count matrix
head(cGSE121810[, ])

A fundamental objective in the analysis of RNA-seq counts data is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from count data. Count data reports for each sample the number of sequence fragments that have been assigned to each gene.

Here, we use the built-in SurfR DGE function, to perform the differential expression analysis of the GSE121810 dataset.

Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting the best candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).

# Perform DGE
df_GEO <- DGE(expression = cGSE121810,
              metadata = mGSE121810,
              Nreplica = 3,
              design = "~condition",
              condition = "condition",
              alpha = 0.05,
              TEST = "neoadjuvant", CTRL =  "adjuvant",
              output_tsv = FALSE)

# remove NA values
df_GEO <- df_GEO[!is.na(df_GEO$padj), ]

head(df_GEO)

Once DEGS have been detected, we may want to isolate Surface protein-coding genes.

The protein classification is based on a recently developed surfaceome predictor, called SURFY, based on machine learning.

We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).

# Detect SP amoung differentially expressed genes
fdr_GeneID <- df_GEO[df_GEO$padj < 0.1, "GeneID"]

SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")

fdrUP_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange > 0, "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")

fdrDW_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange < 0, "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")

4.2.2 Download from TCGA (The Cancer Genome Atlas Program)

TCGA (The Cancer Genome Atlas Research Network et al. 2013) contains data for thousands of tumor samples across more than 20 types of cancer. Navigating through all of the files manually is impossible. Therefore we provide a function based on TCGAbiolinks that automates and streamlines the retrieval of public TCGA transcriptomics data. Note that to use this function you need to install the developmental version of TCGAbiolinks (Mounir et al. 2019) (Colaprico et al. 2016).

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

Here we reanalyze the TCGA-THYM dataset, since it is one of the smallest TCGA datasets, including normal solid tissue samples.

The TCGA count matrix and the metadata are downloaded with SurfR built-in function TCGA_download. The shortLetterCode column of the metadata allows us to distinguish Primary Solid Tumor (TP) and normal (NT) samples.

A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.

We use the built-in SurfR DGE function to perform the differential expression analysis of the TCGA-THYM dataset.

Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).

df_TCGA <- DGE(expression = cTCGA.THYM,
               metadata = mTCGA.THYM,
               Nreplica = 2,
               design = "~shortLetterCode",
               condition = "shortLetterCode",
               alpha = 0.05,
               TEST = "TP", CTRL =  "NT",
               output_tsv = FALSE)

head(df_TCGA)

Once DEGs have been detected, we may want to isolate Surface protein-coding genes.

The protein classification takes advantage of a recently developed surfaceome predictor, called SURFY, based on machine learning.

We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).

# remove NA values
df_TCGA <- df_TCGA[!is.na(df_TCGA$padj), ]

fdr_GeneID <- df_TCGA[df_TCGA$padj < 0.05,
                      "GeneID"]

SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")

fdrUP_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange > 0,
                        "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")

fdrDW_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange < 0,
                        "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")

4.3 Meta-analysis

Analyzing data arising from several experiments studying the same question is a way to obtain more robust results, increasing the detection power of differentially expressed genes.

In SurfR we provide a set of functions based on MetaRNASeq (Rau, Marot, and Jaffrézic 2014) package to combine data from multiple RNAseq experiments.

Let’s suppose we want to integrate Breast cancer data from GEO and TCGA. Breast Cancer datasets are downloaded from TCGA with the SurfR built-in function TCGA_download.

cTCGA.BRCA <- TCGA.BRCA[[1]]
mTCGA.BRCA <- TCGA.BRCA[[2]]
table(mTCGA.BRCA$shortLetterCode)
#> 
#> NT TP 
#>  3  3

In GEO, we want to analyze Varley data (Varley et al. 2014), which includes samples of ER+ breast cancer, Triple Negative Breast cancer, adjacent tissues, and normal breast. These datasets can be retrieved from the GEO series GSE58135, using the SurfR built-in functions GEOmetadata and DownloadArchS4.

mGSE58135 <- GEOmetadata("GSE58135")
mGSE58135 <- mGSE58135[mGSE58135$tissue != "Breast Cancer Cell Line", ]
mGSE58135$condition <- "NT"
mGSE58135$condition[mGSE58135$tissue %in% c("ER+ Breast Cancer Primary Tumor", 
                                            "Triple Negative Breast Cancer Primary Tumor")] <- "TP"

# only select 3 samples per condition to save time
TP_samples <- c("GSM1401694", "GSM1401717", "GSM1401729")
NT_samples <- c("GSM1401799", "GSM1401813", "GSM1401814")
mGSE58135 <- mGSE58135[c(TP_samples, NT_samples), ]
cGSE58135 <- DownloadArchS4(mGSE58135$GSM, species = "human")
cGSE58135 <- cGSE58135[, row.names(mGSE58135)]

table(mGSE58135$condition)
#> 
#> NT TP 
#>  3  3

The first step in the analysis is to detect differentially expressed genes for each count data, separately. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.

We use the built-in SurfR DGE function to perform the differential expression analysis of the TCGA-BRCA dataset and GSE58135.

Good differentially expressed surface proteins should be strongly expressed in one condition and almost not expressed in the other. The output dataframe of the DGE function contains information on the average expression in the two groups (see Mean_CPM_T and Mean_CPM_C columns), to help in the detection of the best candidates.

# TCGA DGE
df.TCGA <- DGE(expression = cTCGA.BRCA,
               metadata = mTCGA.BRCA,
               Nreplica = 3,
               design = "~shortLetterCode",
               condition = "shortLetterCode",
               alpha = 0.05,
               TEST = "TP", CTRL =  "NT",
               output_tsv = FALSE)
#> Warning in DESeqDataSet(se, design = design, ignoreRank): 1233 duplicate
#> rownames were renamed by adding numbers
#> Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
#> design formula are characters, converting to factors
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
head(df.TCGA)
#> log2 fold change (MLE): shortLetterCode TP vs NT 
#> Wald test p-value: shortLetterCode TP vs NT 
#> DataFrame with 6 rows and 8 columns
#>              GeneID Mean_CPM_T Mean_CPM_C log2FoldChange     lfcSE      stat
#>         <character>  <numeric>  <numeric>      <numeric> <numeric> <numeric>
#> CST1           CST1    19.0532  0.0000000       12.33847  1.414031   8.72574
#> MMP13         MMP13    98.5584  0.0923984       10.44607  1.064577   9.81242
#> MMP1           MMP1    94.4173  0.2706227        8.69994  1.749042   4.97412
#> COL10A1     COL10A1   212.4102  1.2454422        7.69794  0.668788  11.51027
#> MMP11         MMP11   723.8036  5.3934170        7.43622  0.534409  13.91485
#> MMP10         MMP10    48.1166  0.5060880        6.92586  0.850766   8.14073
#>              pvalue        padj
#>           <numeric>   <numeric>
#> CST1    2.64455e-18 6.03219e-16
#> MMP13   9.95552e-23 3.43042e-20
#> MMP1    6.55456e-07 2.49767e-05
#> COL10A1 1.17109e-30 9.98197e-28
#> MMP11   5.14677e-44 1.66704e-40
#> MMP10   3.92895e-16 7.48581e-14
# GSE58135 DGE
df.GSE58135 <- DGE(expression = cGSE58135,
                   metadata = mGSE58135,
                   Nreplica = 3,
                   design = "~condition",
                   condition = "condition",
                   alpha = 0.05,
                   TEST = "TP", CTRL =  "NT",
                   output_tsv = FALSE)
#> Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
#> design formula are characters, converting to factors
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
head(df.GSE58135)
#> log2 fold change (MLE): condition TP vs NT 
#> Wald test p-value: condition TP vs NT 
#> DataFrame with 6 rows and 8 columns
#>                      GeneID Mean_CPM_T Mean_CPM_C log2FoldChange     lfcSE
#>                 <character>  <numeric>  <numeric>      <numeric> <numeric>
#> AL513523.1       AL513523.1   14.42919  0.0000000        12.8883   1.31949
#> AL513523.2       AL513523.2   14.42919  0.0000000        12.8883   1.31949
#> FAM204BP           FAM204BP   39.67196  0.0136062        11.7927   1.44085
#> HSFX2                 HSFX2    3.32139  0.0000000        10.8999   1.46267
#> CTD-2369P2.12 CTD-2369P2.12   32.45917  0.0224913        10.7918   1.09479
#> RP11-325P15.2 RP11-325P15.2    2.16334  0.0000000        10.2662   1.39009
#>                    stat      pvalue        padj
#>               <numeric>   <numeric>   <numeric>
#> AL513523.1      9.76762 1.55049e-22 5.46822e-21
#> AL513523.2      9.76762 1.55049e-22 5.46822e-21
#> FAM204BP        8.18454 2.73347e-16 4.00536e-15
#> HSFX2           7.45206 9.18934e-14 9.53677e-13
#> CTD-2369P2.12   9.85743 6.36555e-23 2.38817e-21
#> RP11-325P15.2   7.38527 1.52150e-13 1.53376e-12

Here we provide a function based on metaRNASeq bioconductor package to implement two p-value combination techniques (inverse normal and Fisher methods).

The meta-analysis is performed by the SurfR built-in function metaRNAseq, which requires as input:

  • a list of data.frames with the DGE results of the chosen databases to combine (ind_deg);
  • the statistic test to use to combine p.values, which can be the Fisher method (fishercomb) or the inverse normal combination technique (invnorm);
  • the Benjamini Hochberg threshold (BHth);
  • if using inverse normal combination technique, a vector of the number of replicates used in each study to calculate the previous one-sided p-values (nrep).

The function automatically produces and saves as .pdf histograms of raw p-values for each of the individual differential analyses performed using the independent filtering from DESeq2 package. You can also examine the p-value distribution after p.value combination.


L_fishercomb <- metaRNAseq(ind_deg = list(TCGA.BRCA =  df.TCGA, GEO.GSE58135 = df.GSE58135),
                           test_statistic = "fishercomb",
                           BHth = 0.05,
                           adjpval.t = 0.05)
L_invnorm <- metaRNAseq(ind_deg = list(TCGA.BRCA =  df.TCGA, GEO.GSE58135 = df.GSE58135),
                        test_statistic = "invnorm",
                        BHth = 0.05,
                        adjpval.t = 0.05,
                        nrep = c(102, 56))

Finally, we can summarize the results of the meta-analysis in a data.frame highlighting the statistical information for the common genes to all methods using the built-in SurfR function combine_fisher_invnorm and use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP) among those.

Genes displaying contradictory differential expression in separate studies can be identified in the column signFC= 0 and removed from the list of differentially expressed genes via meta-analysis.

metacomb <- combine_fisher_invnorm(ind_deg = list(TCGA.BRCA =  df.TCGA, 
                                                  GEO.GSE58135 = df.GSE58135),
                                   invnorm = L_invnorm,
                                   fishercomb = L_fishercomb,
                                   adjpval = 0.05)


metacomb_GeneID <- metacomb[metacomb$signFC != 0, "GeneID"]
SP <- Gene2SProtein(genes = metacomb_GeneID, input_type = "gene_name")
#> 866 out of 7394 genes have a matching surface protein
metacombUP_GeneID <- metacomb[metacomb$signFC == 1, "GeneID"]
SPup <- Gene2SProtein(genes = metacombUP_GeneID, input_type = "gene_name")
#> 266 out of 3749 genes have a matching surface protein
metacombDW_GeneID <- metacomb[metacomb$signFC == -1, "GeneID"]
SPdw <- Gene2SProtein(genes = metacombDW_GeneID, input_type = "gene_name")
#> 600 out of 3645 genes have a matching surface protein

4.4 Functional Enrichment

After identifying the subset of genes enriched in our specific condition of interest, a range of analyses becomes useful to move beyond a mere gene list.

A general enrichment analysis allows to gain further insights about upregulated or downregulated DEGs. To do so, we use the SurfR built-in function Enrichment, based on the enrichR cran package (Kuleshov et al. 2016). You have the option to indicate the specific database you wish to utilize among the available in enrichR. The enrichR function listEnrichrDbs() allows you to navigate the options. Sporadically, network connectivity issues may arise with EnrichR server. If it happens, please, retry to run the function after a few minutes.

library(enrichR)

dfList <- list(GEO = df.GSE58135,
               TCGA = df.TCGA)

# Enrichment analysis
Enrich <- Enrichment(dfList, 
                     enrich.databases = c("GO_Biological_Process_2021", 
                                          "KEGG_2021_Human"),
                     p_adj = 0.05, logFC = 1)

head(Enrich$GEO$fdr_up$GO_Biological_Process_2021)

SurfR implements several visualization methods to help interpret enrichment results obtained through EnrichR using ggplot2, with the built-in function Enrichment_barplot.

It depicts gene count ratio and enrichment scores (- Log10 adjusted p values) as bar height and color. Users can specify the number of terms (most significant) to display.

library(ggplot2)

# barplot of the top 5 upregulated pathways
Enrichment_barplot(Enrich$GEO, 
                   enrich.databases <- c("GO_Biological_Process_2021",  
                                         "KEGG_2021_Human"), 
                   p_adj = 0.05, 
                   num_term = 5, 
                   cond = "UP")

# barplot of the top 5 downregulated pathways
Enrichment_barplot(Enrich$GEO, 
                   enrich.databases <- c("GO_Biological_Process_2021", 
                                         "KEGG_2021_Human"), 
                   p_adj = 0.05, 
                   num_term = 5, 
                   cond = "DOWN")

Moreover, we can annotate our list of genes with cross-database identifiers and descriptions (Entrezid, Uniprot, KEGG, etc.), taking advantage of one of the 35 gene-set libraries present in the Enrichr database, using the SurfR built-in function Annotate_SPID.

annotated <- Annotate_SPID(df.GSE58135, "WikiPathway_2021_Human") 
head(annotated, 10)

4.5 Results visualization

4.5.1 Bar plot

Utilizing the SurfR function Splot you can create barplots to visualize the annotation classes reported in the dataframe produced by the Gene2SProtein function. The default grouping is the Membranome Almen classification.

# upregulated genes in GEO dataset
fdrUP_GeneID <- df.GSE58135[df.GSE58135$padj < 0.05 & df.GSE58135$log2FoldChange > 0, "GeneID"]
# corresponding Surface Proteins
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
#> 392 out of 5167 genes have a matching surface protein
# Barplot of Almen classification 
Splot(SPup,
      group.by = "Membranome.Almen.main-class",
      main = "Almen class")
#> Warning in Splot(SPup, group.by = "Membranome.Almen.main-class", main = "Almen
#> class"): NA value in your dataframe

4.5.2 Venn diagram

You can compare the list of resulting surface proteins from up to 7 different studies, using a venn diagram, with the built-in SurfR function SVenn.


S_list <- list(SP_1 = c("EPCAM", "CD24",  "DLK1",  "CDCP1", "LYVE1"),
               SP_2 = c("DLK1", "EPCAM", "EGFR", "UPK1A", "UPK2"))

SVenn(S_list,
      cols.use = c("green", "blue"),
      opacity = 0.5,
      output_intersectionFile = FALSE)

4.5.3 PCA

Principal Components Analysis (PCA) is a very useful diagnostic feature to gain insights about your datasets.

You can perform PCA and visualize the result with a customizable plot with the built-in SurfR function plotPCA.


SurfR::plotPCA(matrix = edgeR::cpm(cGSE58135), metadata = mGSE58135,               
               dims = c(1, 2),
               color.by = "condition", shape.by = "condition", 
               label = FALSE, main = "PCA GSE58135")

5 SessionInfo

sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SurfR_1.0.0      BiocStyle_2.32.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.0               later_1.3.2                
#>   [3] filelock_1.0.3              R.oo_1.26.0                
#>   [5] tibble_3.2.1                preprocessCore_1.66.0      
#>   [7] XML_3.99-0.16.1             rpart_4.1.23               
#>   [9] lifecycle_1.0.4             httr2_1.0.1                
#>  [11] fastcluster_1.2.6           edgeR_4.2.0                
#>  [13] doParallel_1.0.17           processx_3.8.4             
#>  [15] lattice_0.22-6              MASS_7.3-60.2              
#>  [17] backports_1.4.1             magrittr_2.0.3             
#>  [19] openxlsx_4.2.5.2            limma_3.60.0               
#>  [21] Hmisc_5.1-2                 sass_0.4.9                 
#>  [23] rmarkdown_2.26              jquerylib_0.1.4            
#>  [25] yaml_2.3.8                  zip_2.3.1                  
#>  [27] chromote_0.2.0              DBI_1.2.2                  
#>  [29] ade4_1.7-22                 abind_1.4-5                
#>  [31] zlibbioc_1.50.0             rvest_1.0.4                
#>  [33] GenomicRanges_1.56.0        R.utils_2.12.3             
#>  [35] metaRNASeq_1.0.7            purrr_1.0.2                
#>  [37] BiocGenerics_0.50.0         WriteXLS_6.5.0             
#>  [39] phyloseq_1.48.0             nnet_7.3-19                
#>  [41] rappdirs_0.3.3              GenomeInfoDbData_1.2.12    
#>  [43] IRanges_2.38.0              S4Vectors_0.42.0           
#>  [45] ggrepel_0.9.5               vegan_2.6-4                
#>  [47] venn_1.12                   fitdistrplus_1.1-11        
#>  [49] permute_0.9-7               codetools_0.2-20           
#>  [51] DelayedArray_0.30.0         xml2_1.3.6                 
#>  [53] tidyselect_1.2.1            farver_2.1.1               
#>  [55] UCSC.utils_1.0.0            TCGAbiolinksGUI.data_1.23.0
#>  [57] matrixStats_1.3.0           stats4_4.4.0               
#>  [59] BiocFileCache_2.12.0        dynamicTreeCut_1.63-1      
#>  [61] base64enc_0.1-3             jsonlite_1.8.8             
#>  [63] multtest_2.60.0             Formula_1.2-5              
#>  [65] survival_3.6-4              iterators_1.0.14           
#>  [67] foreach_1.5.2               tools_4.4.0                
#>  [69] progress_1.2.3              Rcpp_1.0.12                
#>  [71] glue_1.7.0                  gridExtra_2.3              
#>  [73] SparseArray_1.4.0           admisc_0.35                
#>  [75] xfun_0.43                   mgcv_1.9-1                 
#>  [77] DESeq2_1.44.0               MatrixGenerics_1.16.0      
#>  [79] GenomeInfoDb_1.40.0         websocket_1.4.1            
#>  [81] dplyr_1.1.4                 withr_3.0.0                
#>  [83] BiocManager_1.30.22         fastmap_1.1.1              
#>  [85] rhdf5filters_1.16.0         fansi_1.0.6                
#>  [87] digest_0.6.35               SPsimSeq_1.14.0            
#>  [89] R6_2.5.1                    colorspace_2.1-0           
#>  [91] GO.db_3.19.1                biomaRt_2.60.0             
#>  [93] RSQLite_2.3.6               R.methodsS3_1.8.2          
#>  [95] utf8_1.2.4                  tidyr_1.3.1                
#>  [97] generics_0.1.3              data.table_1.15.4          
#>  [99] prettyunits_1.2.0           httr_1.4.7                 
#> [101] htmlwidgets_1.6.4           S4Arrays_1.4.0             
#> [103] pkgconfig_2.0.3             gtable_0.3.5               
#> [105] blob_1.2.4                  impute_1.78.0              
#> [107] selectr_0.4-2               SingleCellExperiment_1.26.0
#> [109] XVector_0.44.0              htmltools_0.5.8.1          
#> [111] bookdown_0.39               biomformat_1.32.0          
#> [113] scales_1.3.0                Biobase_2.64.0             
#> [115] png_0.1-8                   enrichR_3.2                
#> [117] knitr_1.46                  rstudioapi_0.16.0          
#> [119] rjson_0.2.21                tzdb_0.4.0                 
#> [121] reshape2_1.4.4              checkmate_2.3.1            
#> [123] nlme_3.1-164                curl_5.2.1                 
#> [125] cachem_1.0.8                rhdf5_2.48.0               
#> [127] stringr_1.5.1               parallel_4.4.0             
#> [129] foreign_0.8-86              AnnotationDbi_1.66.0       
#> [131] pillar_1.9.0                grid_4.4.0                 
#> [133] vctrs_0.6.5                 promises_1.3.0             
#> [135] dbplyr_2.5.0                cluster_2.1.6              
#> [137] htmlTable_2.4.2             evaluate_0.23              
#> [139] magick_2.8.3                tinytex_0.50               
#> [141] TCGAbiolinks_2.32.0         readr_2.1.5                
#> [143] mvtnorm_1.2-4               cli_3.6.2                  
#> [145] locfit_1.5-9.9              compiler_4.4.0             
#> [147] rlang_1.1.3                 crayon_1.5.2               
#> [149] labeling_0.4.3              assertr_3.0.1              
#> [151] ps_1.7.6                    plyr_1.8.9                 
#> [153] stringi_1.8.3               WGCNA_1.72-5               
#> [155] BiocParallel_1.38.0         munsell_0.5.1              
#> [157] Biostrings_2.72.0           Matrix_1.7-0               
#> [159] hms_1.1.3                   bit64_4.0.5                
#> [161] ggplot2_3.5.1               Rhdf5lib_1.26.0            
#> [163] KEGGREST_1.44.0             statmod_1.5.0              
#> [165] highr_0.10                  SummarizedExperiment_1.34.0
#> [167] igraph_2.0.3                memoise_2.0.1              
#> [169] bslib_0.7.0                 bit_4.0.5                  
#> [171] downloader_0.4              ape_5.8

References

Assefa, Alemu Takele, Jo Vandesompele, and Olivier Thas. 2020. “SPsimSeq: Semi-Parametric Simulation of Bulk and Single-Cell RNA-Sequencing Data.” Edited by Inanc Birol. Bioinformatics 36 (May): 3276–8. https://academic.oup.com/bioinformatics/article/36/10/3276/5739438.

Bausch-Fluck, Damaris, Ulrich Goldmann, Sebastian Müller, Marc Van Oostrum, Maik Müller, Olga T. Schubert, and Bernd Wollscheid. 2018. “The in Silico Human Surfaceome.” Proc. Natl. Acad. Sci. U.S.A. 115 (November). https://pnas.org/doi/full/10.1073/pnas.1808790115.

Cloughesy, Timothy F., Aaron Y. Mochizuki, Joey R. Orpilla, Willy Hugo, Alexander H. Lee, Tom B. Davidson, Anthony C. Wang, et al. 2019. “Neoadjuvant Anti-PD-1 Immunotherapy Promotes a Survival Benefit with Intratumoral and Systemic Immune Responses in Recurrent Glioblastoma.” Nat Med 25 (3): 477–86. https://www.nature.com/articles/s41591-018-0337-7.

Edgar, R. 2002. “Gene Expression Omnibus: NCBI Gene Expression and Hybridization Array Data Repository.” Nucleic Acids Research 30 (January). https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/30.1.207.

Kuleshov, Maxim V., Matthew R. Jones, Andrew D. Rouillard, Nicolas F. Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, et al. 2016. “Enrichr: A Comprehensive Gene Set Enrichment Analysis Web Server 2016 Update.” Nucleic Acids Res 44 (W1): W90–W97. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw377.

Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. Silverstein, and Avi Ma’ayan. 2018. “Massive Mining of Publicly Available RNA-Seq Data from Human and Mouse.” Nat Commun 9 (1): 1366. https://www.nature.com/articles/s41467-018-03751-6.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biol 15 (December): 550. http://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8.

Mounir, Mohamed, Marta Lucchetta, Tiago C. Silva, Catharina Olsen, Gianluca Bontempi, Xi Chen, Houtan Noushmehr, Antonio Colaprico, and Elena Papaleo. 2019. “New Functionalities in the TCGAbiolinks Package for the Study and Integration of Cancer Data from GDC and GTEx.” PLOS Computational Biology 15 (3): 1–18. https://doi.org/10.1371/journal.pcbi.1006701.

Rau, Andrea, Guillemette Marot, and Florence Jaffrézic. 2014. “Differential Meta-Analysis of RNA-Seq Data from Multiple Studies.” BMC Bioinformatics 15 (1): 91. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-91.

The Cancer Genome Atlas Research Network, John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. 2013. “The Cancer Genome Atlas Pan-Cancer Analysis Project.” Nat Genet 45 (10): 1113–20. https://www.nature.com/articles/ng.2764.

Varley, Katherine E., Jason Gertz, Brian S. Roberts, Nicholas S. Davis, Kevin M. Bowling, Marie K. Kirby, Amy S. Nesmith, et al. 2014. “Recurrent Read-Through Fusion Transcripts in Breast Cancer.” Breast Cancer Res Treat 146 (2): 287–97. http://link.springer.com/10.1007/s10549-014-3019-2.

Zhang, Hailong, Tao Liu, Sha Yi, Lubing Gu, and Muxiang Zhou. 2015. “Targeting MYCN IRES in Mycn ‐Amplified Neuroblastoma with miR‐375 Inhibits Tumor Growth and Sensitizes Tumor Cells to Radiation.” Molecular Oncology 9 (7): 1301–11. https://febs.onlinelibrary.wiley.com/doi/10.1016/j.molonc.2015.03.005.