Contents

1 Introduction

To date, several packages have been developed to infer gene coexpression networks from expression data, such as WGCNA (Langfelder and Horvath 2008), CEMiTool (Russo et al. 2018) and petal (Petereit et al. 2016). However, network inference and analysis is a non-trivial task that requires solid statistical background, especially for data preprocessing and proper interpretation of results. Because of that, inexperienced researchers often struggle to choose the most suitable algorithms for their projects. Besides, different packages are required for each step of a standard network analysis, and their distinct syntaxes can hinder interoperability between packages, particularly for non-advanced R users. Here, we have developed an all-in-one R package that uses state-of-the-art algorithms to facilitate the workflow of biological network analysis, from data acquisition to analysis and interpretation. This will likely accelerate network analysis pipelines and advance systems biology research.

2 Installation

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')

BiocManager::install("BioNERO")
# Load package after installation
library(BioNERO)
## 
set.seed(123) # for reproducibility

3 Data loading and preprocessing

For this tutorial, we will use maize (Zea mays) gene expression data normalized in TPM. The data were obtained from Shin et al. (2020) and were filtered for package size issues. For more information on the data set, see ?zma.se. The data set is stored as a SummarizedExperiment object.1 NOTE: In case you have many tab-separated expression tables in a directory, BioNERO has a helper function named dfs2one() to load all these files and combine them into a single data frame.

The input expression data in BioNERO can be both a SummarizedExperiment object or a gene expression matrix or data frame with genes in rows and samples in columns. However, we strongly recommend using SummarizedExperiment objects for easier interoperability with other Bioconductor packages.

data(zma.se)

# Take a quick look at the data
zma.se
## class: SummarizedExperiment 
## dim: 10802 28 
## metadata(0):
## assays(1): ''
## rownames(10802): ZeamMp030 ZeamMp044 ... Zm00001d054106 Zm00001d054107
## rowData names(0):
## colnames(28): SRX339756 SRX339757 ... SRX2792103 SRX2792104
## colData names(1): Tissue
SummarizedExperiment::colData(zma.se)
## DataFrame with 28 rows and 1 column
##                    Tissue
##               <character>
## SRX339756       endosperm
## SRX339757       endosperm
## SRX339758       endosperm
## SRX339762       endosperm
## SRX339763       endosperm
## ...                   ...
## SRX2792107 whole_seedling
## SRX2792108 whole_seedling
## SRX2792102 whole_seedling
## SRX2792103 whole_seedling
## SRX2792104 whole_seedling

3.1 Step-by-step data preprocessing

This section is suitable for users who want to have more control of their data analysis, as they can inspect the data set after each preprocessing step and analyze how different options to the arguments would affect the expression data. If you want a quick start, you can skip to the next section (Automatic, one-step data preprocessing).

Step 1: Replacing missing values. By default, replace_na() will replace NAs with 0. Users can also replace NAs with the mean of each row (generally not advisable, but it can be useful in very specific cases).

exp_filt <- replace_na(zma.se)
sum(is.na(zma.se))
## [1] 0

Step 2: Removing non-expressed genes. Here, for faster network reconstruction, we will remove every gene whose median value is below 10. The function’s default for min_exp is 1. For other options, see ?remove_nonexp.

exp_filt <- remove_nonexp(exp_filt, method = "median", min_exp = 10)
dim(exp_filt)
## [1] 8529   28

Step 3 (optional): Filtering genes by variance. It is reasonable to remove genes whose expression values do not vary much across samples, since we often want to find genes that are more or less expressed in particular conditions. Here, we will keep only the top 2000 most variable genes. Users can also filter by percentile (e.g., the top 10% most variable genes).

exp_filt <- filter_by_variance(exp_filt, n = 2000)
dim(exp_filt)
## [1] 2000   28

Step 4: Removing outlying samples. There are several methods to remove outliers. We have implemented the Z.K (standardized connectivity) method (Oldham, Langfelder, and Horvath 2012) in ZKfiltering(), which is a network-based approach to remove outliers. This method has proven to be more suitable for network analysis, since it can remove outliers that other methods (such as hierarchical clustering) cannot identify. By default, BioNERO considers all samples with ZK < 2 as outliers, but this parameter is flexible if users want to change it.

exp_filt <- ZKfiltering(exp_filt, cor_method = "pearson")
## Number of removed samples: 1
dim(exp_filt)
## [1] 2000   27

Step 5: Adjusting for confounding artifacts. This is an important step to avoid spurious correlations resulting from confounders. The method was described by Parsana et al. (2019), who developed a principal component (PC)-based correction for confounders. After correction, the expression data are quantile normalized, so every gene follows an approximate normal distribution.

exp_filt <- PC_correction(exp_filt)

3.2 Automatic, one-step data preprocessing

Alternatively, users can preprocess their data with a single function. The function exp_preprocess() is a wrapper for the functions replace_na(), remove_nonexp(), filter_by_variance(), ZKfiltering() and PC_correction(). The arguments passed to exp_preprocess() will be used by each of these functions to generate a filtered expression data frame in a single step.2 NOTE: Here, we are using TPM-normalized data. If you have expression data as raw read counts, set the argument vstransform = TRUE in exp_preprocess(). This will apply DESeq2’s variance stabilizing transformation (Love, Huber, and Anders 2014) to your count data.

final_exp <- exp_preprocess(
    zma.se, min_exp = 10, variance_filter = TRUE, n = 2000
)
## Number of removed samples: 1
identical(dim(exp_filt), dim(final_exp))
## [1] TRUE

# Take a look at the final data
final_exp
## class: SummarizedExperiment 
## dim: 2000 27 
## metadata(0):
## assays(1): ''
## rownames(2000): ZeamMp030 ZeamMp092 ... Zm00001d054093 Zm00001d054107
## rowData names(0):
## colnames(27): SRX339756 SRX339757 ... SRX2792103 SRX2792104
## colData names(1): Tissue

4 Exploratory data analysis

BioNERO includes some functions for easy data exploration. These functions were created to avoid having to type code chunks that, although small, will be used many times. The idea here is to make the user experience with biological network analysis as easy and simple as possible.

Plotting heatmaps: the function plot_heatmap() plots heatmaps of correlations between samples or gene expression in a single line. Besides the arguments users can pass to parameters in plot_heatmap(), they can also pass additional arguments to parameters in ComplexHeatmap::pheatmap() to have additional control additional on plot aesthetics (e.g., hide/show gene and sample names, activate/deactivate clustering for rows and/or columns, etc).

# Heatmap of sample correlations
p <- plot_heatmap(final_exp, type = "samplecor", show_rownames = FALSE)
p


# Heatmap of gene expression (here, only the first 50 genes)
p <- plot_heatmap(
    final_exp[1:50, ], type = "expr", show_rownames = FALSE, show_colnames = FALSE
)
p

Principal component analysis (PCA): the function plot_PCA() performs a PCA and plots whatever pair of PCs users choose (PC1 and PC2 by default), as well the percentage of variance explained by each PC.

plot_PCA(final_exp)