Contents

Version: 1.2.1

1 Overview

The transcriptogramer package is designed for transcriptional analysis based on transcriptograms, a method to analyze transcriptomes that projects expression values on a set of ordered proteins, arranged such that the probability that gene products participate in the same metabolic pathway exponentially decreases with the increase of the distance between two proteins of the ordering. Transcriptograms are, hence, genome wide gene expression profiles that provide a global view for the cellular metabolism, while indicating gene sets whose expression are altered (Silva and Almeida 2014; Rybarczyk-Filho and Almeida 2011).

Methods are provided to analyze topological properties of an interactome, to generate transcriptograms, to detect and to display differentially expressed gene clusters, and to perform a functional enrichment analysis on these clusters.

As a set of ordered proteins is required in order to run the methods, datasets are available for four species (Homo sapiens, Mus musculus, Saccharomyces cerevisiae and Rattus norvegicus). Each species has three datasets, originated from STRINGdb release 10.5 protein network data, with combined scores greater than or equal to 700, 800 and 900 (see Hs900, Hs800, Hs700, Mm900, Mm800, Mm700, Sc900, Sc800, Sc700, Rn900, Rn800 and Rn700 datasets). Custom sets of ordered proteins can be generated from protein network data using The transcriptogramer on Windows.

2 Quick start

The first step is to create a Transcriptogram object by running the transcriptogramPreprocess() function. This example uses a subset of the Homo sapiens protein network data, from STRINGdb release 10.5, containing only associations of proteins of combined score greater than or equal to 900 (see Hs900 and association datasets).

library(transcriptogramer)
t <- transcriptogramPreprocess(association = association, ordering = Hs900)

2.1 Topological analysis

There are two methods to perform topological analysis, connectivityProperties() calculates average graph properties as a function of the node connectivity, and orderingProperties() calculates graph properties projected on the ordered proteins. Some methods, such as orderingProperties(), uses a window, region of n (radius * 2 + 1) proteins centered at a protein, whose radius changes the output. The Transcriptogram object has a radius slot that can be setted during, or after, its preprocessing (see Transcriptogram-class documentation).

## during the preprocessing

## creating the object and setting the radius as 0
t <- transcriptogramPreprocess(association = association, ordering = Hs900)

## creating the object and setting the radius as 50
t <- transcriptogramPreprocess(association = association, ordering = Hs900,
                               radius = 50)
## after the preprocessing

## modifying the radius of an existing Transcriptogram object
radius(object = t) <- 25

## getting the radius of an existing Transcriptogram object
r <- radius(object = t)

The output of the orderingProperties() method is partially affected by the radius slot.

oPropertiesR25 <- orderingProperties(object = t, nCores = 1)

## slight change of radius
radius(object = t) <- 30

## this output is partially different comparing to oPropertiesR25
oPropertiesR30 <- orderingProperties(object = t, nCores = 1)

As the connectivityProperties() method does not uses a window, its output is not affected by the radius slot.

cProperties <- connectivityProperties(object = t)

2.2 Transcriptogram

A transcriptogram is generated in two steps and requires expression values, from microarray or RNA-Seq assays1 log2-counts-per-million, and a dictionary. This example uses the datasets GSE9988, which contains normalized expression values of 3 cases and 3 controls2 GSM252443, GSM252444, GSM252445, GSM252465, GSM252466 and GSM252467 respectively, and GPL570, a mapping between ENSEMBL Peptide ID and Affymetrix Human Genome U133 Plus 2.0 Array probe identifier.

The methods to generate a transcriptogram are transcriptogramStep1() and transcriptogramStep2(). The transcriptogramStep1() assigns to each protein, of each transcriptome sample, the average of the expression values of all the identifiers related to it.

t <- transcriptogramStep1(object = t, expression = GSE9988,
                          dictionary = GPL570, nCores = 1)

To each position of the ordering, the transcriptogramStep2() method assigns a value equal to the average of the expression values inside a window, which considers periodic boundary conditions to deal with proteins near the ends of the ordering, in order to reduce random noise.

t <- transcriptogramStep2(object = t, nCores = 1)

The Transcriptogram object has slots to store the outputs of the transcriptogramStep1() and transcriptogramStep2() methods, called transcriptogramS1 and transcriptogramS2 respectively. As the output of some methods are affected by the content of the transcriptogramS2 slot, it can be recalculated using the content of the transcriptogramS1 slot.

radius(object = t) <- 80
t <- transcriptogramStep2(object = t, nCores = 1)

2.3 Functional enrichment analysis

As nearby genes of a transcriptogram have a high probability to interact with each other, gene sets whose expression are altered can be identified using the limma package. The differentiallyExpressed() method uses the limma package to identify differentially expressed genes3 the approaches voom and limma-trend are supported for RNA-Seq, for the contrast “case-control”, grouping as a cluster a set of genes which positions are within a radius range specified by the content of the radius slot.

For this example, the p-value threshold for false discovery rate will be set as 0.01. If the name of a species is provided on the input, the biomaRt package will be used to translate the ENSEMBL Peptide ID to Symbol (Gene Name), alternatively, a data.frame can be provided and used instead. The levels argument classify the columns of the transcriptogramS2 slot referring to samples, as there are 6 columns (see dataset GSE9988), is created a logical vector that uses TRUE to label the columns referring to controls samples, and FALSE to label the columns referring to case samples.

## trend = FALSE for microarray data or voom log2-counts-per-million
## the default value for trend is FALSE
levels <- c(rep(FALSE, 3), rep(TRUE, 3))
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
                             trend = FALSE)
## translating ENSEMBL Peptide IDs to Symbols using the biomaRt package
## Internet connection is required for this command
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
                             species = "Homo sapiens")

## translating ENSEMBL Peptide IDs to Symbols using the DEsymbols dataset
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
                             species = DEsymbols)

This method also produces a plot referring to its output. In this case, eleven clusters were detected, and each one is represented by a color. It is important to mention that not all the colored genes were detected as differentially expressed, but, as they were within the radius specified by the content of the radius slot, they were included in a cluster. The genes that are above the horizontal black line are upregulated, and the genes that are below are downregulated.

The differentially expressed genes identified by this method are stored in the DE slot of the Transcriptogram object, its content can be obtained using the DE method. By default, the p-values are adjusted by the Benjamini-Hochberg procedure.

DE <- DE(object = t)

The clusterVisualization() method uses the RedeR package to display graphs of the differentially expressed clusters and returns an object of the RedPort Class, allowing interactions through functions of the RedeR package. This method may take some time depending on the number of clusters, and nodes per cluster, and requires the Java Runtime Environment (>= 6). If the DE slot of the Transcriptogram object has a column named Symbol, its contents will be used as node alias.

rdp <- clusterVisualization(object = t)

The clusterEnrichment() method perform a functional enrichment analysis using the topGO package. By default, the universe is composed by all the proteins present in the transcriptogramS2 slot, the ontology is setted to biological process, the algorithm is setted to classic, the statistic is setted to fisher, and the p-values are adjusted by the Benjamini-Hochberg procedure. For this example, the p-value threshold for false discovery rate will be set as 0.005. This method uses the biomaRt package to build a gene2GO list if the name of a species is provided on the input, alternatively, a data.frame can be provided and used instead.

## using the HsBPTerms dataset to create the gene2GO list
terms <- clusterEnrichment(object = t, species = HsBPTerms,
                           pValue = 0.005, nCores = 1)
## using the biomaRt package to create the gene2GO list
## Internet connection is required for this command
terms <- clusterEnrichment(object = t, species = "Homo sapiens",
                           pValue = 0.005, nCores = 1)
head(terms)
##        GO.ID                                        Term Annotated
## 1 GO:0010257         NADH dehydrogenase complex assembly        54
## 2 GO:0032981 mitochondrial respiratory chain complex ...        54
## 3 GO:0097031 mitochondrial respiratory chain complex ...        54
## 4 GO:0033108 mitochondrial respiratory chain complex ...        66
## 5 GO:0006120 mitochondrial electron transport, NADH t...        47
## 6 GO:0042775 mitochondrial ATP synthesis coupled elec...        83
##   Significant Expected       pValue ClusterNumber
## 1          52     0.95 6.638182e-28             1
## 2          52     0.95 6.638182e-28             1
## 3          52     0.95 6.638182e-28             1
## 4          52     1.16 6.638182e-28             1
## 5          42     0.83 6.638182e-28             1
## 6          43     1.46 6.638182e-28             1

3 Session info

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] transcriptogramer_1.2.1 BiocStyle_2.8.1        
## 
## loaded via a namespace (and not attached):
##  [1] progress_1.1.2       xfun_0.1             lattice_0.20-35     
##  [4] colorspace_1.3-2     doSNOW_1.0.16        htmltools_0.3.6     
##  [7] snow_0.4-2           stats4_3.5.0         yaml_2.1.19         
## [10] blob_1.1.1           XML_3.98-1.11        rlang_0.2.0         
## [13] pillar_1.2.2         DBI_1.0.0            BiocGenerics_0.26.0 
## [16] bit64_0.9-7          topGO_2.32.0         matrixStats_0.53.1  
## [19] foreach_1.4.4        plyr_1.8.4           stringr_1.3.1       
## [22] munsell_0.4.3        gtable_0.2.0         codetools_0.2-15    
## [25] memoise_1.1.0        evaluate_0.10.1      Biobase_2.40.0      
## [28] knitr_1.20           SparseM_1.77         IRanges_2.14.10     
## [31] biomaRt_2.36.1       parallel_3.5.0       AnnotationDbi_1.42.1
## [34] Rcpp_0.12.17         backports_1.1.2      scales_0.5.0        
## [37] limma_3.36.1         RedeR_1.28.0         S4Vectors_0.18.2    
## [40] graph_1.58.0         bit_1.1-13           ggplot2_2.2.1       
## [43] digest_0.6.15        stringi_1.2.2        bookdown_0.7        
## [46] rprojroot_1.3-2      grid_3.5.0           tools_3.5.0         
## [49] bitops_1.0-6         magrittr_1.5         RCurl_1.95-4.10     
## [52] lazyeval_0.2.1       RSQLite_2.1.1        tibble_1.4.2        
## [55] GO.db_3.6.0          pkgconfig_2.0.1      data.table_1.11.2   
## [58] prettyunits_1.0.2    assertthat_0.2.0     rmarkdown_1.9       
## [61] httr_1.3.1           iterators_1.0.9      R6_2.2.2            
## [64] igraph_1.2.1         compiler_3.5.0
warnings()

References

Rybarczyk-Filho, Castro, J.L., and R.M.C. de Almeida. 2011. “Towards a Genome-Wide Transcriptogram: The Saccharomyces Cerevisiae Case.” Nucleic Acids Research 39 (8):3005–16. https://doi.org/10.1093/nar/gkq1269.

Silva, Perrone da, S.R.M., and R.M.C. de Almeida. 2014. “Reproducibility Enhancement and Differential Expression of Non Predefined Functional Gene Sets in Human Genome.” BMC Genomics. https://doi.org/10.1186/1471-2164-15-1181.