To date, several packages have been developed to infer gene coexpression networks from expression data, such as WGCNA (Langfelder and Horvath 2008), CEMiTool (Russo et al. 2018) and petal (Petereit et al. 2016). However, network inference and analysis is a non-trivial task that requires solid statistical background, especially for data preprocessing and proper interpretation of results. Because of that, inexperienced researchers often struggle to choose the most suitable algorithms for their projects. Besides, different packages are required for each step of a standard network analysis, and their distinct syntaxes can hinder interoperability between packages, particularly for non-advanced R users. Here, we have developed an all-in-one R package that uses state-of-the-art algorithms to facilitate the workflow of biological network analysis, from data acquisition to analysis and interpretation. This will likely accelerate network analysis pipelines and advance systems biology research.
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("BioNERO")
# Load package after installation
library(BioNERO)
##
set.seed(123) # for reproducibility
For this tutorial, we will use maize (Zea mays) gene expression data
normalized in TPM. The data were obtained from Shin et al. (2020) and were filtered
for package size issues. For more information on the data set, see ?zma.se
.
The data set is stored as a SummarizedExperiment object.1 NOTE: In case you have many tab-separated expression tables in a
directory, BioNERO
has a helper function named dfs2one()
to load all these
files and combine them into a single data frame.
The input expression data in BioNERO
can be both a SummarizedExperiment
object or a gene expression matrix or data frame with genes in rows and samples
in columns. However, we strongly recommend using SummarizedExperiment objects
for easier interoperability with other Bioconductor packages.
data(zma.se)
# Take a quick look at the data
zma.se
## class: SummarizedExperiment
## dim: 10802 28
## metadata(0):
## assays(1): ''
## rownames(10802): ZeamMp030 ZeamMp044 ... Zm00001d054106 Zm00001d054107
## rowData names(0):
## colnames(28): SRX339756 SRX339757 ... SRX2792103 SRX2792104
## colData names(1): Tissue
SummarizedExperiment::colData(zma.se)
## DataFrame with 28 rows and 1 column
## Tissue
## <character>
## SRX339756 endosperm
## SRX339757 endosperm
## SRX339758 endosperm
## SRX339762 endosperm
## SRX339763 endosperm
## ... ...
## SRX2792107 whole_seedling
## SRX2792108 whole_seedling
## SRX2792102 whole_seedling
## SRX2792103 whole_seedling
## SRX2792104 whole_seedling
This section is suitable for users who want to have more control of their data analysis, as they can inspect the data set after each preprocessing step and analyze how different options to the arguments would affect the expression data. If you want a quick start, you can skip to the next section (Automatic, one-step data preprocessing).
Step 1: Replacing missing values. By default, replace_na()
will replace
NAs with 0. Users can also replace NAs with the mean of each row
(generally not advisable, but it can be useful in very specific cases).
exp_filt <- replace_na(zma.se)
sum(is.na(zma.se))
## [1] 0
Step 2: Removing non-expressed genes. Here, for faster network
reconstruction, we will remove every gene whose median value is below 10.
The function’s default for min_exp
is 1.
For other options, see ?remove_nonexp
.
exp_filt <- remove_nonexp(exp_filt, method = "median", min_exp = 10)
dim(exp_filt)
## [1] 8529 28
Step 3 (optional): Filtering genes by variance. It is reasonable to remove genes whose expression values do not vary much across samples, since we often want to find genes that are more or less expressed in particular conditions. Here, we will keep only the top 2000 most variable genes. Users can also filter by percentile (e.g., the top 10% most variable genes).
exp_filt <- filter_by_variance(exp_filt, n = 2000)
dim(exp_filt)
## [1] 2000 28
Step 4: Removing outlying samples. There are several methods to remove
outliers. We have implemented the Z.K (standardized connectivity)
method (Oldham, Langfelder, and Horvath 2012) in ZKfiltering()
, which is a network-based approach to
remove outliers. This method has proven to be more suitable for network
analysis, since it can remove outliers that other methods
(such as hierarchical clustering) cannot identify. By default, BioNERO
considers all samples with ZK < 2 as outliers, but this parameter
is flexible if users want to change it.
exp_filt <- ZKfiltering(exp_filt, cor_method = "pearson")
## Number of removed samples: 1
dim(exp_filt)
## [1] 2000 27
Step 5: Adjusting for confounding artifacts. This is an important step to avoid spurious correlations resulting from confounders. The method was described by Parsana et al. (2019), who developed a principal component (PC)-based correction for confounders. After correction, the expression data are quantile normalized, so every gene follows an approximate normal distribution.
exp_filt <- PC_correction(exp_filt)
Alternatively, users can preprocess their data with a single function.
The function exp_preprocess()
is a wrapper for the functions replace_na()
,
remove_nonexp()
, filter_by_variance()
, ZKfiltering()
and PC_correction()
.
The arguments passed to exp_preprocess()
will be used by each of these
functions to generate a filtered expression data frame in a single step.2 NOTE: Here, we are using TPM-normalized data. If you have expression
data as raw read counts, set the argument vstransform = TRUE
in exp_preprocess()
. This will apply DESeq2’s variance stabilizing
transformation (Love, Huber, and Anders 2014) to your count data.
final_exp <- exp_preprocess(
zma.se, min_exp = 10, variance_filter = TRUE, n = 2000
)
## Number of removed samples: 1
identical(dim(exp_filt), dim(final_exp))
## [1] TRUE
# Take a look at the final data
final_exp
## class: SummarizedExperiment
## dim: 2000 27
## metadata(0):
## assays(1): ''
## rownames(2000): ZeamMp030 ZeamMp092 ... Zm00001d054093 Zm00001d054107
## rowData names(0):
## colnames(27): SRX339756 SRX339757 ... SRX2792103 SRX2792104
## colData names(1): Tissue
BioNERO
includes some functions for easy data exploration. These functions
were created to avoid having to type code chunks that, although small, will be
used many times. The idea here is to make the user experience with biological
network analysis as easy and simple as possible.
Plotting heatmaps: the function plot_heatmap()
plots heatmaps of
correlations between samples or gene expression in a single line.
Besides the arguments users can pass to parameters in plot_heatmap()
,
they can also pass additional arguments to parameters
in ComplexHeatmap::pheatmap()
to have additional control additional on
plot aesthetics (e.g., hide/show gene and sample names, activate/deactivate
clustering for rows and/or columns, etc).
# Heatmap of sample correlations
p <- plot_heatmap(final_exp, type = "samplecor", show_rownames = FALSE)
p
# Heatmap of gene expression (here, only the first 50 genes)
p <- plot_heatmap(
final_exp[1:50, ], type = "expr", show_rownames = FALSE, show_colnames = FALSE
)
p
Principal component analysis (PCA): the function plot_PCA()
performs a
PCA and plots whatever pair of PCs users choose (PC1 and PC2 by default), as
well the percentage of variance explained by each PC.
plot_PCA(final_exp)