pcaExplorer provides functionality for interactive visualization of RNA-seq datasets based on Principal Components Analysis. Such methods allow for quick information extraction and effective data exploration. A Shiny application encapsulates the whole analysis, and is developed to become a practical companion for any RNA-seq dataset.
Package: pcaExplorer
Authors: Federico Marini [aut, cre]
Version: 1.0.2
Compiled date: 2016-05-15
License: MIT + file LICENSE
pcaExplorer is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:
source("http://bioconductor.org/biocLite.R")
biocLite("pcaExplorer")
Once pcaExplorer is installed, it can be loaded by the following command.
library("pcaExplorer")
pcaExplorer is a Bioconductor package containing a Shiny application for analyzing expression data in different conditions and experimental factors.
It is a general-purpose interactive companion tool for RNA-seq analysis, which guides the user in exploring the Principal Components of the data under inspection.
pcaExplorer provides tools and functionality to detect outlier samples, genes that show particular patterns, and additionally provides a functional interpretation of the principal components for further quality assessment and hypothesis generation on the input data.
Moreover, a novel visualization approach is presented to simultaneously assess the effect of more than one experimental factor on the expression levels.
Thanks to its interactive/reactive design, it is designed to become a practical companion to any RNA-seq dataset analysis, making exploratory data analysis accessible also to the bench biologist, while providing additional insight also for the experienced data analyst.
After loading the package, the pcaExplorer app can be launched in different modes:
pcaExplorer(dds = dds, rld = rld)
, where dds
is a DESeqDataSet
object and rld
is a DESeqTransform
object, which were created during an existing session for the analysis of an RNA-seq dataset with the DESeq2 package
pcaExplorer(dds = dds)
, where dds
is a DESeqDataSet
object. The rld
object is automatically computed upon launch.
pcaExplorer(countmatrix = countmatrix, coldata = coldata)
, where countmatrix
is a count matrix, generated after assigning reads to features such as genes via tools such as HTSeq-count
or featureCounts
, and coldata
is a data frame containing the experimental covariates of the experiments, such as condition, tissue, cell line, run batch and so on.
pcaExplorer()
, and then subsequently uploading the count matrix and the covariates data frame through the user interface. These files need to be formatted as tab separated files, which is a common format for storing such count values.
Additional parameters and objects that can be provided to the main pcaExplorer function are:
pca2go
, which is an object created by the pca2go
function, which scans the genes with high loadings in each principal component and each direction, and looks for functions (such as GO Biological Processes) that are enriched above the background. The offline pca2go
function is based on the routines and algorithms of the topGO package, but as an alternative, this object can be computed live during the execution of the app exploiting the goana
function, provided by the limma package. Although this likely provides more general (and probably less informative) functions, it is a good compromise for obtaining a further data interpretation.
annotation
, a data frame object, with row.names
as gene identifiers (e.g. ENSEMBL ids) identical to the row names of the count matrix or dds
object, and an extra column gene_name
, containing e.g. HGNC-based gene symbols. This can be used for making information extraction easier, as ENSEMBL ids (a usual choice when assigning reads to features) do not provide an immediate readout for which gene they refer to. This can be either passed as a parameter when launching the app, or also uploaded as a tab separated text file.
The pcaExplorer app is structured in different panels, each focused on a different aspect of the data exploration.
Most of the panels work extensively with click-based and brush-based interactions, to gain additional depth in the explorations, for example by zooming, subsetting, selecting. This is possible thanks to the recent developments in the shiny package/framework.
The available panels are the described in the following subsections.
Contains general information on pcaExplorer, including the developer’s contact, the link to the development version in Github, as well as the output of sessionInfo
, to use for reproducibility sake - or bug reporting.
This is where you most likely are reading this text (otherwise in the package vignette).
This panel displays information on the objects in use, either passed as parameters or generated from the count matrix provided. It also displays its metadata in an interactive table, along with an overview of the number of reads assigned to features (in the typical use case, genes) for each sample in form of a barplot and summary statistics.
This panel displays the PCA projections of sample expression profiles onto any pair of components, a scree plot, a zoomed PCA plot, a plot of the genes with top and bottom loadings. Additionally, this section presents a PCA plot where it is possible to remove samples deemed to be outliers in the analysis, which is very useful to check the effect of excluding them.
This panel displays the PCA projections of genes abundances onto any pair of components, with samples as biplot variables, to identify interesting groups of genes. Zooming is also possible, and clicking on single genes, a boxplot is returned, grouped by the factors of interest. A static and an interactive heatmap are provided, including the subset of selected genes. These are also reported in datatable
objects.
The user can search and display the expression values of a gene of interest, either by ID or gene name, as provided in the annotation
. A handy panel for quick screening of shortlisted genes, again grouped by the factors of interest. The graphic can be readily exported as it is.
This panel shows the functional annotation of the principal components, with GO functions enriched in the genes with high loadings on the selected principal components. It allows for the live computing of the object, that can otherwise provided as a parameter when launching the app. The panel displays a PCA plot for the samples, surrounded on each side by the tables with the functions enriched in each component and direction.
This panel allows for the multifactor exploration of datasets with 2 or more experimental factors. The user has to select first the two factors and the levels for each. Then, it is possible to combine samples from Factor1-Level1 in the selected order by clicking on each sample name, one for each level available in the selected Factor2. In order to build the matrix, an equal number of samples for each level of Factor 1 is required, to keep the design somehow balanced. A typical case for choosing factors 1 and 2 is for example when different conditions and tissues are present.
Once constructed, a plot is returned that tries to represent simultaneously the effect of the two factors on the data. Each gene is represented by a dot-line-dot structure, with the color that is indicating the tissue (factor 2) where the gene is mostly expressed. Each gene has two dots, one for each condition level (factor 1), and the position of the points is dictated by the scores of the principal components calculated on the matrix object. The line connecting the dots is darker when the tissue where the gene is mostly expressed varies throughout the conditions.
This representation is under active development, and it is promising for identifying interesting sets or clusters of genes according to their behavior on the Principal Components subspaces. Zooming and exporting of the underlying genes is also allowed by brushing on the main plot.
pcaExplorer
on published datasetsWe can run pcaExplorer for demonstration purpose on published datasets that are available as SummarizedExperiment in an experiment Bioconductor packages.
We will use the airway dataset, which can be installed with this command
source("https://bioconductor.org/biocLite.R")
biocLite("airway")
This package provides a RangedSummarizedExperiment
object of read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. More details such as gene models and count quantifications can be found in the airway package vignette.
To run pcaExplorer on this dataset, the following commands are required. First, prepare the objects to be passed as parameters of pcaExplorer
library(airway)
library(DESeq2)
data(airway)
dds_airway <- DESeqDataSet(airway,design= ~ cell + dex)
dds_airway
class: DESeqDataSet
dim: 64102 8
metadata(2): '' version
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
rld_airway <- rlogTransformation(dds_airway)
rld_airway
class: DESeqTransform
dim: 64102 8
metadata(2): '' version
assays(1): ''
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(6): baseMean baseVar ... dispFit rlogIntercept
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample sizeFactor
Then launch the app itself
pcaExplorer(dds = dds_airway,
rlt = rld_airway)
The annotation
for this dataset can be built by exploiting the org.Hs.eg.db package
library(org.Hs.eg.db)
genenames_airway <- mapIds(org.Hs.eg.db,keys = rownames(dds_airway),column = "SYMBOL",keytype="ENSEMBL")
annotation_airway <- data.frame(gene_name = genenames_airway,
row.names = rownames(dds_airway),
stringsAsFactors = FALSE)
head(annotation_airway)
gene_name
ENSG00000000003 TSPAN6
ENSG00000000005 TNMD
ENSG00000000419 DPM1
ENSG00000000457 SCYL3
ENSG00000000460 C1orf112
ENSG00000000938 FGR
Then again, the app can be launched with
pcaExplorer(dds = dds_airway,
rlt = rld_airway,
annotation = annotation_airway)
If desired, alternatives can be used. See the well written annotation workflow available at the Bioconductor site (https://bioconductor.org/help/workflows/annotation/annotation/).
pcaExplorer
on synthetic datasetsFor testing and demonstration purposes, a function is also available to generate synthetic datasets whose counts are generated based on two or more experimental factors.
This can be called with the command
dds_multifac <- makeExampleDESeqDataSet_multifac(betaSD_condition = 3,betaSD_tissue = 1)
See all the available parameters by typing ?makeExampleDESeqDataSet_multifac
. Credits are given to the initial implementation by Mike Love in the DESeq2 package.
The following steps run the app with the synthetic dataset
dds_multifac <- makeExampleDESeqDataSet_multifac(betaSD_condition = 1,betaSD_tissue = 3)
dds_multifac
class: DESeqDataSet
dim: 1000 12
metadata(1): version
assays(1): counts
rownames(1000): gene1 gene2 ... gene999 gene1000
rowData names(4): trueIntercept trueBeta_condition trueBeta_tissue
trueDisp
colnames(12): sample1 sample2 ... sample11 sample12
colData names(2): condition tissue
rld_multifac <- rlogTransformation(dds_multifac)
rld_multifac
class: DESeqTransform
dim: 1000 12
metadata(1): version
assays(1): ''
rownames(1000): gene1 gene2 ... gene999 gene1000
rowData names(10): trueIntercept trueBeta_condition ... dispFit
rlogIntercept
colnames(12): sample1 sample2 ... sample11 sample12
colData names(3): condition tissue sizeFactor
## checking how the samples cluster on the PCA plot
pcaplot(rld_multifac,intgroup = c("condition","tissue"))
Launch the app for exploring this dataset with
pcaExplorer(dds = dds_multifac,
rlt = rld_multifac)
When such a dataset is provided, the panel for multifactorial exploration is also usable at its best.
The functions exported by the pcaExplorer package can be also used in a standalone scenario, provided the required objects are in the working environment. They are listed here for an overview, but please refer to the documentation for additional details. Where possible, for each function a code snippet will be provided for its typical usage.
pcaplot
pcaplot
plots the sample PCA for DESeqTransform
objects, such as rlog-transformed data. This is the workhorse of the Samples View tab
pcaplot(rld_airway,intgroup = c("cell","dex"),ntop = 1000,
pcX = 1, pcY = 2, title = "airway dataset PCA on samples - PC1 vs PC2")
# on a different set of principal components...
pcaplot(rld_airway,intgroup = c("dex"),ntop = 1000,
pcX = 3, pcY = 4, title = "airway dataset PCA on samples - PC3 vs PC4")
pcascree
pcascree
produces a scree plot of the PC computed on the samples. A prcomp
object needs to be passed as main argument
pcaobj_airway <- prcomp(t(assay(rld_airway)))
pcascree(pcaobj_airway,type="pev",
title="Proportion of explained proportion of variance - airway dataset")
correlatePCs
and plotPCcorrs
correlatePCs
and plotPCcorrs
respectively compute and plot significance of the (cor)relation of each covariate versus a principal component. The input for correlatePCs
is a prcomp
object
res_pcairway <- correlatePCs(pcaobj_airway,colData(dds_airway))
res_pcairway
SampleName cell dex albut Run avgLength Experiment
PC_1 0.4288799 0.68227033 0.02092134 NA 0.4288799 0.2554109 0.4288799
PC_2 0.4288799 0.11161023 0.56370286 NA 0.4288799 0.1993592 0.4288799
PC_3 0.4288799 0.10377716 0.38647623 NA 0.4288799 0.1864725 0.4288799
PC_4 0.4288799 0.08331631 0.56370286 NA 0.4288799 0.4635148 0.4288799
Sample BioSample
PC_1 0.4288799 0.4288799
PC_2 0.4288799 0.4288799
PC_3 0.4288799 0.4288799
PC_4 0.4288799 0.4288799
plotPCcorrs(res_pcairway)
hi_loadings
hi_loadings
extracts and optionally plots the genes with the highest loadings
# extract the table of the genes with high loadings
hi_loadings(pcaobj_airway,topN = 10,exprTable=counts(dds_airway))
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000143127 11 108 24 485 41
ENSG00000168309 12 274 35 451 1
ENSG00000101347 1632 17126 2098 19694 1598
ENSG00000211445 916 15749 3142 24057 1627
ENSG00000096060 260 4652 381 3875 601
ENSG00000163884 70 1325 52 702 36
ENSG00000171819 4 50 19 543 1
ENSG00000127954 13 247 25 889 2
ENSG00000152583 62 2040 99 1172 100
ENSG00000109906 4 739 5 429 1
ENSG00000162692 914 62 1192 55 1359
ENSG00000178695 4746 830 4805 414 5321
ENSG00000214814 312 24 193 28 501
ENSG00000164742 1506 347 275 14 137
ENSG00000138316 1327 207 1521 118 1962
ENSG00000123610 444 136 303 36 1170
ENSG00000124766 2483 406 2057 185 2829
ENSG00000105989 562 47 1575 106 106
ENSG00000013293 268 23 435 56 558
ENSG00000146250 330 41 907 89 720
SRR1039517 SRR1039520 SRR1039521
ENSG00000143127 607 77 660
ENSG00000168309 65 4 193
ENSG00000101347 17697 1683 32036
ENSG00000211445 16274 1741 24883
ENSG00000096060 5493 154 4118
ENSG00000163884 487 34 1355
ENSG00000171819 10 14 1067
ENSG00000127954 199 20 462
ENSG00000152583 1924 79 2138
ENSG00000109906 581 12 1113
ENSG00000162692 171 646 31
ENSG00000178695 1391 4411 606
ENSG00000214814 65 789 76
ENSG00000164742 37 475 56
ENSG00000138316 618 1045 152
ENSG00000123610 195 473 37
ENSG00000124766 870 1851 301
ENSG00000105989 24 382 46
ENSG00000013293 75 562 74
ENSG00000146250 123 439 60
# or alternatively plot the values
hi_loadings(pcaobj_airway,topN = 10,annotation = annotation_airway)
genespca
genespca
computes and plots the principal components of the genes, eventually displaying the samples as in a typical biplot visualization. This is the function in action for the Genes View tab
groups_airway <- colData(dds_airway)$dex
cols_airway <- scales::hue_pal()(2)[groups_airway]
# with many genes, do not plot the labels of the genes
genespca(rld_airway,ntop=5000,
choices = c(1,2),
arrowColors=cols_airway,groupNames=groups_airway,
alpha = 0.2,
useRownamesAsLabels=FALSE,
varname.size = 5
)
# with a smaller number of genes, plot gene names included in the annotation
genespca(rld_airway,ntop=100,
choices = c(1,2),
arrowColors=cols_airway,groupNames=groups_airway,
alpha = 0.7,
varname.size = 5,
annotation = annotation_airway
)
topGOtable
topGOtable
is a convenient wrapper for extracting functional GO terms enriched in a subset of genes (such as the differentially expressed genes), based on the algorithm and the implementation in the topGO package
# example not run due to quite long runtime
dds_airway <- DESeq(dds_airway)
res_airway <- results(dds_airway)
res_airway$symbol <- mapIds(org.Hs.eg.db,
keys=row.names(res_airway),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
res_airway$entrez <- mapIds(org.Hs.eg.db,
keys=row.names(res_airway),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
resOrdered <- as.data.frame(res_airway[order(res_airway$padj),])
head(resOrdered)
# extract DE genes
de_df <- resOrdered[resOrdered$padj < .05 & !is.na(resOrdered$padj),]
de_symbols <- de_df$symbol
# extract background genes
bg_ids <- rownames(dds_airway)[rowSums(counts(dds_airway)) > 0]
bg_symbols <- mapIds(org.Hs.eg.db,
keys=bg_ids,
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
# run the function
topgoDE_airway <- topGOtable(de_symbols, bg_symbols,
ontology = "BP",
mapping = "org.Hs.eg.db",
geneID = "symbol")
pca2go
pca2go
provides a functional interpretation of the principal components, by extracting the genes with the highest loadings for each PC, and then runs internally topGOtable
on them for efficient functional enrichment analysis. Needs a DESeqTransform
object as main parameter
pca2go_airway <- pca2go(rld_airway,
annotation = annotation_airway,
organism = "Hs",
ensToGeneSymbol = TRUE,
background_genes = bg_ids)
# for a smooth interactive exploration, use DT::datatable
datatable(pca2go_airway)
limmaquickpca2go
limmaquickpca2go
is an alternative to pca2go
, used in the live running app, thanks to its fast implementation based on the limma::goana
function.
goquick_airway <- limmaquickpca2go(rld_airway,
pca_ngenes = 10000,
inputType = "ENSEMBL",
organism = "Hs")
# display it in the normal R session...
head(goquick_airway$PC1$posLoad)
# ... or use it for running the app and display in the dedicated tab
pcaExplorer(dds_airway,rld_airway,
pca2go = goquick_airway,
annotation = annotation_airway)
makeExampleDESeqDataSet_multifac
makeExampleDESeqDataSet_multifac
constructs a simulated DESeqDataSet
of Negative Binomial dataset from different conditions. The fold changes between the conditions can be adjusted with the betaSD_condition
betaSD_tissue
arguments
dds_simu <- makeExampleDESeqDataSet_multifac(betaSD_condition = 3,betaSD_tissue = 0.5)
dds_simu
class: DESeqDataSet
dim: 1000 12
metadata(1): version
assays(1): counts
rownames(1000): gene1 gene2 ... gene999 gene1000
rowData names(4): trueIntercept trueBeta_condition trueBeta_tissue
trueDisp
colnames(12): sample1 sample2 ... sample11 sample12
colData names(2): condition tissue
dds2_simu <- makeExampleDESeqDataSet_multifac(betaSD_condition = 0.5,betaSD_tissue = 2)
dds2_simu
class: DESeqDataSet
dim: 1000 12
metadata(1): version
assays(1): counts
rownames(1000): gene1 gene2 ... gene999 gene1000
rowData names(4): trueIntercept trueBeta_condition trueBeta_tissue
trueDisp
colnames(12): sample1 sample2 ... sample11 sample12
colData names(2): condition tissue
rld_simu <- rlogTransformation(dds_simu)
rld2_simu <- rlogTransformation(dds2_simu)
pcaplot(rld_simu,intgroup = c("condition","tissue")) +
ggplot2::ggtitle("Simulated data - Big condition effect, small tissue effect")
pcaplot(rld2_simu,intgroup = c("condition","tissue")) +
ggplot2::ggtitle("Simulated data - Small condition effect, bigger tissue effect")
Additional functionality for the pcaExplorer will be added in the future, as it is tightly related to a topic under current development research.
Improvements, suggestions, bugs, issues and feedback of any type can be sent to marinif@uni-mainz.de.
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Hs.eg.db_3.3.0 AnnotationDbi_1.34.2
[3] DESeq2_1.12.2 airway_0.106.1
[5] SummarizedExperiment_1.2.2 GenomicRanges_1.24.0
[7] GenomeInfoDb_1.8.2 IRanges_2.6.0
[9] S4Vectors_0.10.0 pcaExplorer_1.0.2
[11] bigmemory_4.5.19 bigmemory.sri_0.1.3
[13] Biobase_2.32.0 BiocGenerics_0.18.0
[15] knitr_1.13 BiocStyle_2.0.2
loaded via a namespace (and not attached):
[1] foreach_1.4.3 splines_3.3.0 topGO_2.24.0
[4] Formula_1.2-1 shiny_0.13.2 latticeExtra_0.6-28
[7] RBGL_1.48.0 Category_2.38.0 yaml_2.1.13
[10] ggrepel_0.5 RSQLite_1.0.0 lattice_0.20-33
[13] limma_3.28.4 chron_2.3-47 digest_0.6.9
[16] RColorBrewer_1.1-2 XVector_0.12.0 colorspace_1.2-6
[19] htmltools_0.3.5 httpuv_1.3.3 Matrix_1.2-6
[22] plyr_1.8.3 GSEABase_1.34.0 XML_3.98-1.4
[25] SparseM_1.7 genefilter_1.54.2 zlibbioc_1.18.0
[28] xtable_1.8-2 GO.db_3.3.0 scales_0.4.0
[31] BiocParallel_1.6.2 annotate_1.50.0 pkgmaker_0.22
[34] ggplot2_2.1.0 DT_0.1 nnet_7.3-12
[37] survival_2.39-4 magrittr_1.5 mime_0.4
[40] evaluate_0.9 doParallel_1.0.10 NMF_0.20.6
[43] foreign_0.8-66 GOstats_2.38.0 shinydashboard_0.5.1
[46] graph_1.50.0 tools_3.3.0 registry_0.3
[49] data.table_1.9.6 gridBase_0.4-7 formatR_1.4
[52] matrixStats_0.50.2 stringr_1.0.0 munsell_0.4.3
[55] locfit_1.5-9.1 cluster_2.0.4 rngtools_1.2.4
[58] grid_3.3.0 iterators_1.0.8 AnnotationForge_1.14.2
[61] htmlwidgets_0.6 labeling_0.3 shinyBS_0.61
[64] base64enc_0.1-3 rmarkdown_0.9.6 gtable_0.2.0
[67] codetools_0.2-14 d3heatmap_0.6.1.1 DBI_0.4-1
[70] reshape2_1.4.1 R6_2.1.2 gridExtra_2.2.1
[73] Hmisc_3.17-4 stringi_1.0-1 Rcpp_0.12.5
[76] geneplotter_1.50.0 rpart_4.1-10 acepack_1.3-3.3
[79] png_0.1-7