fcoex: co-expression for single-cell data

Introduction and basic pipeline
Integration to Seurat
- Plotting clusters with Seurat
- Detecting anticorrelated genes in the modules

Introduction and basic pipeline

The goal of fcoex is to provide a simple and intuitive way to generate co-expression nets and modules for single cell data. It is based in 3 steps:

Pre-processing and label assignement (prior to fcoex)
Discretization of gene expression
Correlation and module detection via the FCBF algorithm (Fast Correlation-Based Filter)

First of all, we will a already preprocessed single cell dataset from 10XGenomics ( preprocessed according to https://osca.bioconductor.org/a-basic-analysis.html#preprocessing-import-to-r, 14/08/2019). It contains peripheral blood mononuclear cells and the most variable genes.

library(fcoex, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
data("mini_pbmc3k")

This is the single cell object we will explore in this vignette:

mini_pbmc3k

## class: SingleCellExperiment 
## dim: 1700 600 
## metadata(0):
## assays(2): counts logcounts
## rownames(1700): LYZ S100A9 ... ALG10 LLGL1
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames(600): V1 V2 ... V599 V600
## colData names(14): Sample Barcode ... mod_FCER1G mod_CD3D
## reducedDimNames(2): PCA UMAP
## altExpNames(0):

Creating the fcoex object

Now let’s use the normalized data and the cluster labels to build the co-expresison networks. The labels were obtained by louvain clustering on a graph build from nearest neighbours. That means that these labels are a posteriori, and this depends on the choice of the analyst.

The fcoex object is created from 2 different pieces: a previously normalized expression table (genes in rows) and a target factor with classes for the cells.

target <- colData(mini_pbmc3k)
target <- target$clusters
exprs <- as.data.frame(assay(mini_pbmc3k, 'logcounts'))

fc <- new_fcoex(data.frame(exprs),target)

The first step is the conversion of the count matrix into a discretized dataframe. The standar of fcoex is a simple binarization that works as follows:

For each gene, the maximum and minimum values are stored. This range is divided in n bins of equal width (parameter to be set). The first bin is assigned to the class “low” and all the others to the class “high”.

fc <- discretize(fc, number_of_bins = 8)

Getting the modules

Note that many other discretizations are avaible, from the implementations in the FCBF Bioconductor package. This step affects the final results in many ways. However, we found empirically that the default parameter often yields interesting results.

After the discretization, we proceed to constructing a network and extracting modules. The co-expression adjacency matriz generated is modular in its inception. All correlations used are calculated via Symmetrical Uncertainty. Three steps are present:

1 - Selection of n genes to be considered, ranked by correlation to the target variable.

2 - Detection of predominantly correlated genes, a feature selection approach defined in the FCBF algorithm

3 - Building of modules around selected genes. Correlations between two genes are kept if they are more correlated to each other than to the target lables

You can choose either to have a non-parallel processing, with a progress bar, or a faster parallel processing without progress bar. Up to you.

fc <- find_cbf_modules(fc,n_genes = 200, verbose = FALSE, is_parallel = FALSE)

## [1] "Number of prospective features =  199"

There are two functions that do the same: both get_nets and plot_interactions take the modules and plot networks. You can pick the name you like better. These visualizations were heavily inspired by the CEMiTool package, as much of the code in fcoex.

We will take a look at the first two networks

fc <- get_nets(fc)

# Taking a look at the first two networks: 
show_net(fc)[["CD79A"]]

show_net(fc)[["HLA-DRB1"]]

To save the plots, you can run the save plots function, which will create a “./Plots” directory and store plots there.

save_plots(name = "fcoex_vignette", fc,force = TRUE, directory = "./Plots")

Running an enrichment analysis

You can also run over-representation analysis to see if the modules correspond to any known biological pathway. In this example we will use the reactome groups available in the CEMiTool package.

It is likely that some modules will not have any enrichment, leading to messages of the type “no gene can be mapped.”. That is not a problem

gmt_fname <- system.file("extdata", "pathways.gmt", package = "CEMiTool")

gmt_in <- pathwayPCA::read_gmt(gmt_fname)
fc <- mod_ora(fc, gmt_in)

## --> No gene can be mapped....

## --> Expected input gene ID: RPS9,DCN,RPS21,EIF3F,GNG13,PSEN2

## --> return NULL...

## --> No gene can be mapped....

## --> Expected input gene ID: RPL13A,NOTCH3,ABCA2,RPL10,ABCC1,RPL27A

## --> return NULL...

fc <- plot_ora(fc)

Now we can save the plots again. Note that we have to set the force parameter equal to TRUE now, as the “./Plots” directory was already created in the previous step.

save_plots(name = "fcoex_vignette", fc, force = TRUE, directory = "./Plots")

Reclustering the cells to find module-based populations.

There is now a folder with the correlations, to explore the data.

We will use the module assignments to subdivide the cells in populations of interest. This is a way to explore the data and look for possible novel groupings ignored in the previous clustering step.

fc <- recluster(fc)

## [1] "FCER1G"
## [1] "CD3D"
## [1] "HLA-DRB1"
## [1] "AIF1"
## [1] "CST3"
## [1] "CD79A"
## [1] "CST7"
## [1] "CTSS"
## [1] "FCGR3A"
## [1] "CD7"
## [1] "SAT1"
## [1] "S100A6"

We generated new labels based on each fcoex module. Now we will visualize them using UMAP. Let’s see the population represented in the modules CD79A and HLA-DRB1. Notably, the patterns are largely influenced by the patterns of header genes. It is interesting to see that two groups are present, header-positive (HP) and header negative (HN) clusters.

The stratification and exploration of different clustering points of view is one of the core

Plotting clusters with scater

colData(mini_pbmc3k) <- cbind(colData(mini_pbmc3k), `mod_HLA-DRB1` = idents(fc)$`HLA-DRB1`)
colData(mini_pbmc3k) <- cbind(colData(mini_pbmc3k), mod_CD79A = idents(fc)$CD79A)

library(scater)

## Loading required package: ggplot2

# Let's see the original clusters
plotReducedDim(mini_pbmc3k, dimred="UMAP", colour_by="clusters")

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:BiocGenerics':
## 
##     combine

p1 <- plotReducedDim(mini_pbmc3k, dimred="UMAP", colour_by="mod_HLA-DRB1")
p2 <- plotReducedDim(mini_pbmc3k, dimred="UMAP", colour_by="HLA-DRB1")
p3 <- plotReducedDim(mini_pbmc3k, dimred="UMAP", colour_by="mod_CD79A")
p4 <- plotReducedDim(mini_pbmc3k, dimred="UMAP", colour_by="CD79A")

grid.arrange(p1, p2, p3, p4, nrow=2)

Integration to Seurat

Another popular framework for single-cell analysis is Seurat. Let’s see how to integrate it with fcoex.

library(Seurat)

## 
## Attaching package: 'Seurat'

## The following object is masked from 'package:SummarizedExperiment':
## 
##     Assays

library(fcoex)
library(ggplot2)
data(pbmc_small)

exprs <- data.frame(GetAssayData(pbmc_small))
target <- Idents(pbmc_small)

fc <- new_fcoex(data.frame(exprs),target)

## Created new fcoex object.

fc <- discretize(fc)
fc <- find_cbf_modules(fc,n_genes = 70, verbose = FALSE, is_parallel = FALSE)

## Getting SU scores

## Running FCBF to find module headers

## [1] "Number of prospective features =  69"

## Calculating adjacency matrix

## Trimming and getting modules from adjacency matrix

fc <- get_nets(fc)

Once again we can use the Reactome pathways from the package CEMiTool to exemplify how to run an enrichment for human genes. It is likely that some modules will not have any enrichment, leading to messages of the type “no gene can be mapped.”. That is not a problem. If you are not working with human genes, you can just skip this part.

gmt_fname <- system.file("extdata", "pathways.gmt", package = "CEMiTool")
gmt_in <- pathwayPCA::read_gmt(gmt_fname)
fc <- mod_ora(fc, gmt_in)

# In Seurat's sample data, pbmc small, no enrichments are found. 
# That is way plot_ora is commented out.

#fc <- plot_ora(fc)

save_plots(name = "fcoex_vignette_Seurat", fc, force = TRUE, directory = "./Plots")

Plotting clusters with Seurat

Now let’s recluster fcoex and visualize the new clusters via the UMAP saved in the Seurat object.

fc <- recluster(fc) 

file_name <- "pbmc3k_recluster_plots.pdf"
directory <- "./Plots/"

pbmc_small <- RunUMAP(pbmc_small, dims = 1:10)

pdf(paste0(directory,file_name), width = 3, height = 3)
print(DimPlot(pbmc_small))
for (i in names(module_genes(fc))){
  Idents(pbmc_small ) <-   idents(fc)[[i]]
  mod_name <- paste0("M", which(names(idents(fc)) == i), " (", i,")")

  plot2 <- DimPlot(pbmc_small, reduction = 'umap', cols = c("darkgreen", "dodgerblue3")) +
    ggtitle(mod_name) 
    print(plot2)
}
dev.off()

The clusters generate by fcoex match possible matches different Seurat clusters. Looking at the HN clusters: M1 matches cluster 1 (likely monocytes), M2 and M4 match clusters 1 and 2 (likely APCs, B + monocytes), M5 matches cluster 2 (likeky B) M7 maches a subset of cluster 0, and as it includes granzymes and granulolysins, this subset of 0 is likely cytotoxic cells (either NK or CD8)

Let’s just take a look at the M2 individually:

  fc <- recluster(fc)

## Detecting clusters for the following modules:

## [1] "S100A8"

## S100A8

## [1] "HLA-DPB1"

## HLA-DPB1

## [1] "TYMP"

## TYMP

## [1] "HLA-DRB1"

## HLA-DRB1

## [1] "MS4A1"

## MS4A1

## [1] "LYZ"

## LYZ

## [1] "CD7"

## CD7

## [1] "IFI6"

## IFI6

## [1] "HLA-DQA2"

## HLA-DQA2

  pbmc_small <- RunUMAP(pbmc_small, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 20:10:17 UMAP embedding parameters a = 0.9922 b = 1.112

## 20:10:17 Read 80 rows and found 10 numeric columns

## 20:10:17 Using Annoy for neighbor search, n_neighbors = 30

## 20:10:17 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 20:10:17 Writing NN index file to temp file /tmp/RtmpHpbmmI/file48992a5c098a
## 20:10:17 Searching Annoy index using 1 thread, search_k = 3000
## 20:10:17 Annoy recall = 100%
## 20:10:18 Commencing smooth kNN distance calibration using 1 thread
## 20:10:18 7 smooth knn distance failures
## 20:10:19 Initializing from normalized Laplacian + noise
## 20:10:19 Commencing optimization for 500 epochs, with 2410 positive edges
## 20:10:20 Optimization finished

  Idents(pbmc_small ) <-   target
  p1 <- DimPlot(pbmc_small)
  Idents(pbmc_small ) <-   idents(fc)[["HLA-DRB1"]]
  
  mod_name <- paste0("M", which(names(idents(fc)) == "HLA-DRB1"), " (", "HLA-DRB1",")")

  p2 <- DimPlot(pbmc_small, cols = c("darkgreen", "dodgerblue3")) +
    ggtitle(mod_name) 
  
  # CD79A is a marker of B cells
  CD79A <- FeaturePlot(pbmc_small, "CD79A")

  # AIF1 is a marker of monocytes
  AIF1 <- FeaturePlot(pbmc_small, "AIF1")
  
  
  library(gridExtra)
  grid.arrange(p1, p2, p3,p4, ncol = 2)

Detecting anticorrelated genes in the modules

As the dataset used here is a small subset of the original, some cells might be in unexpected clusters.

The modules capture also negative correlations. These can be especially interesting, as they point to complementary cell types. Let’s look if the module genes are enriched in the cluster HP ( header positive, in blue) or HN (header negative, in green).

for (i in names(module_genes(fc))){
Idents(pbmc_small ) <-   fc@mod_idents[[i]]

# This bit prints which gene in the module belongs to each cluster. 
# HP is the header-positive cluster (containing SOX19A), HN is the header negative cluster (not containing SOX19A)
# The "features = fc@module_list[[i]]" parameter tells Seurat to compare only the genes in the module "i"
# By removing this parameter, you can potentially expand the list that was retrieved originally by fcoex

# Run only for module genes:
module_genes_in_clusters <- FindAllMarkers(pbmc_small, logfc.threshold = 1, only.pos = TRUE, features = fc@module_list[[i]] )

if("HN" %in% module_genes_in_clusters$cluster){
module_genes_in_clusters$module = i
message(paste0("anticorrelated genes found for module ", i))
print(module_genes_in_clusters) 
}
}

##                 p_val avg_logFC pct.1 pct.2    p_val_adj cluster     gene
## TUBB1    1.642052e-03  4.236709 0.225 0.000 3.776720e-01      HN    TUBB1
## GP9      1.642052e-03  3.744318 0.225 0.000 3.776720e-01      HN      GP9
## NGFRAP1  1.642052e-03  3.347207 0.225 0.000 3.776720e-01      HN  NGFRAP1
## HLA-DRB1 2.464537e-15  4.372935 0.975 0.050 5.668436e-13      HP HLA-DRB1
## HLA-DRA  2.981826e-13  3.112174 0.975 0.400 6.858199e-11      HP  HLA-DRA
## HLA-DRB5 1.046771e-12  3.417522 0.875 0.025 2.407574e-10      HP HLA-DRB5
## HLA-DPA1 1.764119e-12  3.150101 0.900 0.150 4.057473e-10      HP HLA-DPA1
## HLA-DQB1 6.587968e-10  3.178397 0.725 0.025 1.515233e-07      HP HLA-DQB1
## HLA-DMB  1.089737e-07  2.583729 0.600 0.025 2.506394e-05      HP  HLA-DMB
## HLA-DQA1 2.018550e-07  3.085995 0.600 0.050 4.642664e-05      HP HLA-DQA1
## HLA-DMA  1.073164e-06  2.229396 0.575 0.050 2.468277e-04      HP  HLA-DMA
## CFP      2.045245e-06  1.872703 0.575 0.050 4.704064e-04      HP      CFP
## LY86     2.536648e-05  2.724763 0.425 0.025 5.834291e-03      HP     LY86
## LGALS2   4.185935e-05  1.869678 0.475 0.050 9.627650e-03      HP   LGALS2
## HVCN1    4.187025e-04  3.688648 0.275 0.000 9.630157e-02      HP    HVCN1
## ASGR1    1.642052e-03  2.153305 0.225 0.000 3.776720e-01      HP    ASGR1
##            module
## TUBB1    HLA-DRB1
## GP9      HLA-DRB1
## NGFRAP1  HLA-DRB1
## HLA-DRB1 HLA-DRB1
## HLA-DRA  HLA-DRB1
## HLA-DRB5 HLA-DRB1
## HLA-DPA1 HLA-DRB1
## HLA-DQB1 HLA-DRB1
## HLA-DMB  HLA-DRB1
## HLA-DQA1 HLA-DRB1
## HLA-DMA  HLA-DRB1
## CFP      HLA-DRB1
## LY86     HLA-DRB1
## LGALS2   HLA-DRB1
## HVCN1    HLA-DRB1
## ASGR1    HLA-DRB1

There seems that only in the module HLA-DRB1 we have negative correlations. Let’s visualize then

 TUBB1 <- FeaturePlot(pbmc_small, "TUBB1")
 DRB1 <-  FeaturePlot(pbmc_small, "HLA-DRB1")
  
  
  library(gridExtra)
  grid.arrange(p1, p2, TUBB1, DRB1, ncol = 2)