Introduction and basic pipeline

The goal of fcoex is to provide a simple and intuitive way to generate co-expression nets and modules for single cell data. It is based in 3 steps:

First of all, we will a already preprocessed single cell dataset from 10XGenomics ( preprocessed according to https://osca.bioconductor.org/a-basic-analysis.html#preprocessing-import-to-r, 14/08/2019). It contains peripheral blood mononuclear cells and the most variable genes.

library(fcoex, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
data("mini_pbmc3k")

This is the single cell object we will explore in this vignette:

mini_pbmc3k
## class: SingleCellExperiment 
## dim: 1700 600 
## metadata(0):
## assays(2): counts logcounts
## rownames(1700): LYZ S100A9 ... ALG10 LLGL1
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames(600): V1 V2 ... V599 V600
## colData names(14): Sample Barcode ... mod_FCER1G mod_CD3D
## reducedDimNames(2): PCA UMAP
## altExpNames(0):

Creating the fcoex object

Now let’s use the normalized data and the cluster labels to build the co-expresison networks. The labels were obtained by louvain clustering on a graph build from nearest neighbours. That means that these labels are a posteriori, and this depends on the choice of the analyst.

The fcoex object is created from 2 different pieces: a previously normalized expression table (genes in rows) and a target factor with classes for the cells.

target <- colData(mini_pbmc3k)
target <- target$clusters
exprs <- as.data.frame(assay(mini_pbmc3k, 'logcounts'))

fc <- new_fcoex(data.frame(exprs),target)

The first step is the conversion of the count matrix into a discretized dataframe. The standar of fcoex is a simple binarization that works as follows:

For each gene, the maximum and minimum values are stored. This range is divided in n bins of equal width (parameter to be set). The first bin is assigned to the class “low” and all the others to the class “high”.

fc <- discretize(fc, number_of_bins = 8)

Getting the modules

Note that many other discretizations are avaible, from the implementations in the FCBF Bioconductor package. This step affects the final results in many ways. However, we found empirically that the default parameter often yields interesting results.

After the discretization, we proceed to constructing a network and extracting modules. The co-expression adjacency matriz generated is modular in its inception. All correlations used are calculated via Symmetrical Uncertainty. Three steps are present:

1 - Selection of n genes to be considered, ranked by correlation to the target variable.

2 - Detection of predominantly correlated genes, a feature selection approach defined in the FCBF algorithm

3 - Building of modules around selected genes. Correlations between two genes are kept if they are more correlated to each other than to the target lables

You can choose either to have a non-parallel processing, with a progress bar, or a faster parallel processing without progress bar. Up to you.

fc <- find_cbf_modules(fc,n_genes = 200, verbose = FALSE, is_parallel = FALSE)
## [1] "Number of prospective features =  199"

There are two functions that do the same: both get_nets and plot_interactions take the modules and plot networks. You can pick the name you like better. These visualizations were heavily inspired by the CEMiTool package, as much of the code in fcoex.

We will take a look at the first two networks

fc <- get_nets(fc)

# Taking a look at the first two networks: 
show_net(fc)[["CD79A"]]