Package: pcaExplorer
Authors: Federico Marini [aut, cre] (https://orcid.org/0000-0003-3252-7758)
Version: 2.24.0
Compiled date: 2022-11-01
Last edited: 2019-02-26
License: MIT + file LICENSE

1 Getting started

pcaExplorer is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("pcaExplorer")

To install pcaExplorer with all its dependencies (i.e. also the ones listed in the Suggests: field of the DESCRIPTION file, which include the dataset from the airway package used as a demo), use this command instead:

BiocManager::install("pcaExplorer", dependencies = TRUE)

If you prefer, you can install and use the development version, which can be retrieved via Github (https://github.com/federicomarini/pcaExplorer). To do so, use:

library("devtools")
install_github("federicomarini/pcaExplorer")

Once pcaExplorer is installed, it can be loaded by the following command.

library("pcaExplorer")

2 Introduction

pcaExplorer is a Bioconductor package containing a Shiny application for analyzing expression data in different conditions and experimental factors.

It is a general-purpose interactive companion tool for RNA-seq analysis, which guides the user in exploring the Principal Components of the data under inspection.

pcaExplorer provides tools and functionality to detect outlier samples, genes that show particular patterns, and additionally provides a functional interpretation of the principal components for further quality assessment and hypothesis generation on the input data.

Moreover, a novel visualization approach is presented to simultaneously assess the effect of more than one experimental factor on the expression levels.

Thanks to its interactive/reactive design, it is designed to become a practical companion to any RNA-seq dataset analysis, making exploratory data analysis accessible also to the bench biologist, while providing additional insight also for the experienced data analyst.

Starting from development version 1.1.3, pcaExplorer supports reproducible research with state saving and automated report generation. Each generated plot and table can be exported by simple mouse clicks on the dedicated buttons.

2.1 Citation info

If you use pcaExplorer for your analysis, please cite it as here below:

citation("pcaExplorer")

Please cite the articles below for the 'pcaExplorer' software itself,
or its usage in combined workflows with the 'ideal' or 'GeneTonic'
software packages:

  Federico Marini, Harald Binder (2019). pcaExplorer: an R/Bioconductor
  package for interacting with RNA-seq principal components. BMC
  Bioinformatics, 20 (1), 331, <doi:10.1186/s12859-019-2879-1>,
  <doi:10.18129/B9.bioc.pcaExplorer>.

  Annekathrin Ludt, Arsenij Ustjanzew, Harald Binder, Konstantin
  Strauch, Federico Marini (2022). Interactive and Reproducible
  Workflows for Exploring and Modeling RNA-seq Data with pcaExplorer,
  ideal, and GeneTonic. Current Protocols, 2 (4), e411,
  <doi:10.1002/cpz1.411>.

To see these entries in BibTeX format, use 'print(<citation>,
bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.

3 Launching the application

After loading the package, the pcaExplorer app can be launched in different modes:

  • pcaExplorer(dds = dds, dst = dst), where dds is a DESeqDataSet object and dst is a DESeqTransform object, which were created during an existing session for the analysis of an RNA-seq dataset with the DESeq2 package.

  • pcaExplorer(dds = dds), where dds is a DESeqDataSet object. The dst object can be automatically computed upon launch, choosing between rlog transformation, variance stabilizing transformations, or shifted logarithm transformation (with pseudocount = 1).

  • pcaExplorer(countmatrix = countmatrix, coldata = coldata), where countmatrix is a count matrix, generated after assigning reads to features such as genes via tools such as HTSeq-count or featureCounts, and coldata is a data frame containing the experimental covariates of the experiments, such as condition, tissue, cell line, run batch and so on. If the data is provided in this way, the user can click on the “Generate the dds and dst objects” button to complete the setup and enable the subsequent steps in the other panels.

  • pcaExplorer(), and then subsequently uploading the count matrix and the covariates data frame through the user interface. These files need to be formatted as tab, semicolon, or comma separated text files, all of which are common formats for storing such count values.

Additional parameters and objects that can be provided to the main pcaExplorer function are:

  • pca2go, which is an object created by the pca2go function, which scans the genes with high loadings in each principal component and each direction, and looks for functions (such as GO Biological Processes) that are enriched above the background. The offline pca2go function is based on the routines and algorithms of the topGO package, but as an alternative, this object can be computed live during the execution of the app with limmaquickpca2go (which relies on the goana function provided by the limma package). Although this likely provides more general (and probably less informative) functions, it is a good compromise for quickly obtaining a further data interpretation.

  • annotation, a data frame object, with row.names as gene identifiers (e.g. ENSEMBL ids) identical to the row names of the count matrix or dds object, and an extra column gene_name, containing e.g. HGNC-based gene symbols. This can be used for making information extraction easier, as ENSEMBL ids (a usual choice when assigning reads to features) do not provide an immediate readout for which gene they refer to. This can be either passed as a parameter when launching the app, or also uploaded as a text file (either tab, comma, or semicolon-separated). The package provides two functions, get_annotation and get_annotation_orgdb, as a convenient wrapper to obtain the updated annotation information, respectively from biomaRt or via the org.XX.eg.db packages.

3.1 How to provide your input data in pcaExplorer

pcaExplorer supports a number of file formats when uploading the data via the file input widgets. Starting from version 2.9.5, we added functionality to select the separator character for each of the uploadable files. An information box is also shown by clicking on the question mark icon in the Data upload panel, with detailed information (text, as well as screenshots of valid input files) on the format specification.

In general, pcaExplorer requires by default tab separated files:

  • the countmatrix: contains the expression matrix, with one gene per row and one sample per column; the first column should contain the gene identifiers, and the header (first row) specifies the sample names.
  • the coldata: one sample per row, and one experimental covariate per column. Row names should be specified in the first column, and have to match the column names of the countmatrix. Column names will contain the specific experimental covariates.
  • the annotation (optional): one gene per row, and one identifier type per column. Gene identifiers in the first column are identical to the row names of the countmatrix or dds objects. At least an extra column gene_name, containing e.g. HGNC-based gene symbols, needs to be provided.

3.2 Up and running with pcaExplorer

We recommend users to switch to the dedicated vignette, entitled “Up and running with pcaExplorer”.

This document describes a use case for pcaExplorer, based on the dataset in the airway package.

4 The controls sidebar

Most of the input controls are located in the sidebar, some are as well in the individual tabs of the app. By changing one or more of the input parameters, the user can get a fine control on what is displayed.

4.1 App settings

Here are the parameters that set input values for most of the tabs. By hovering over with the mouse, the user can receive additional information on how to set the parameter, with tooltips powered by the shinyBS package.

  • x-axis PC - Select the principal component to display on the x axis
  • y-axis PC - Select the principal component to display on the y axis
  • Group/color by - Select the group of samples to stratify the analysis. Can also assume multiple values.
  • Nr of (most variant) genes - Number of genes to select for computing the principal components. The top n genes are selected ranked by their variance inter-samples
  • Alpha - Color transparency for the plots. Can assume values from 0 (transparent) to 1 (opaque)
  • Labels size - Size of the labels for the samples in the principal components plots. This parameter also controls the size of the gene labels, which are displayed in the Genes View once the user has brushed an area in the main plot.
  • Points size - Size of the points to be plotted in the principal components plots
  • Variable name size - Size of the labels for the genes PCA - correspond to the samples names
  • Scaling factor - Scale value for resizing the arrow corresponding to the variables in the PCA for the genes. It should be used for mere visualization purposes
  • Color palette - Select the color palette to be used in the principal components plots. The number of colors is selected automatically according to the number of samples and to the levels of the factors of interest and their interactions
  • Plot style for gene counts - Plot either boxplots or violin plots, with jittered points superimposed

4.2 Plot export settings

Width and height for the figures to export are input here in cm.

Additional controls available in the single tabs are also assisted by tooltips that show on hovering the mouse. Normally they are tightly related to the plot/output they are placed nearby.

5 The task menu

The task menu, accessible by clicking on the cog icon in the upper right part of the application, provides two functionalities:

  • Exit pcaExplorer & save will close the application and store the content of the input and values reactive objects in two list objects made available in the global environment, called pcaExplorer_inputs_YYYYMMDD_HHMMSS and pcaExplorer_values_YYYYMMDD_HHMMSS
  • Save State as .RData will similarly store LiveInputs and r_data in a binary file named pcaExplorerState_YYYYMMDD_HHMMSS.Rdata, without closing the application

6 The app panels

The pcaExplorer app is structured in different panels, each focused on a different aspect of the data exploration.

Most of the panels work extensively with click-based and brush-based interactions, to gain additional depth in the explorations, for example by zooming, subsetting, selecting. This is possible thanks to the recent developments in the shiny package/framework.

The available panels are described in the following subsections.

6.1 Data Upload

These file input controls are available when no dds or countmatrix + coldata are provided. Additionally, it is possible to upload the annotation data frame. If the objects are already passed as parameters, or after they have been successfully uploaded, a brief overview/summary for them can be displayed, by clicking on each respective action button.

This panel is where you can perform the preprocessing steps on the data you uploaded/provided:

  • compose the dds object (if you provided countmatrix and coldata)
  • normalize the expression values (using the robust method proposed by Anders and Huber in the original DESeq manuscript)
  • compute the variance stabilizing transformed expression values (stored in the dst object).

As a note regarding the normalization procedure: the normalization method (implemented in estimateSizeFactors) relies on the hypothesis that most of the genes are not differentially expressed across experimental groups, and this holds true for the majority of scenarios. The DESeqDataSet object, which pcaExplorer takes as main data container, can still accommodate sample (and gene) specific normalization factors. Should this assumption be violated, users can pre-compute these factors and store them in the input dds object.

6.2 Instructions

This is where you might be reading a version of the “Up and running with pcaExplorer” vignette. Additionally, you can easily reach the fully rendered vignettes, either installed locally, or directly from the Bioconductor package page.