fcoex: co-expression for single-cell data integrated with Seurat

Integration to Seurat

This vignette is targeted at Seurat users that want to use fcoex.

A more in depth explanation of each fcoex step is available in the general vignette, which is based on bioconductor packages.

To integrate a pre-processed Seurat object with fcoex, it is only necessary to obtain the normalized expression matrix and the cluster identities, and then run the pipeline:

library(Seurat)
library(fcoex)
library(ggplot2)

data(pbmc_small)

exprs <- data.frame(GetAssayData(pbmc_small))
target <- Idents(pbmc_small)

fc <- new_fcoex(data.frame(exprs),target)

## Created new fcoex object.

fc <- discretize(fc)
fc <- find_cbf_modules(fc,n_genes = 70, verbose = FALSE, is_parallel = FALSE)

## Getting SU scores for each gene

## Running FCBF to find module headers

## [1] "Number of features features =  230"
## [1] "Number of prospective features =  69"
## [1] "Number of final features =  9"

## Calculating adjacency matrix

## Trimming and getting modules from adjacency matrix

## 9 modules were found.

fc <- get_nets(fc)

Let’s take a look at the modules headers:

mod_names(fc)

## [1] "S100A8"   "HLA-DPB1" "TYMP"     "HLA-DRB1" "MS4A1"    "LYZ"      "CD7"     
## [8] "IFI6"     "HLA-DQA2"

S100A8 is a marker of some granulocytes and MS4A1 is a markers of B cells. Let’s take a look at those two modules.

mod_names(fc)

## [1] "S100A8"   "HLA-DPB1" "TYMP"     "HLA-DRB1" "MS4A1"    "LYZ"      "CD7"     
## [8] "IFI6"     "HLA-DQA2"

network_plots <- show_net(fc)

network_plots[["S100A8"]]

network_plots[["MS4A1"]]

Once again we can use the Reactome pathways from the package CEMiTool to exemplify how to run an enrichment for human genes. It is likely that some modules will not have any enrichment, leading to messages of the type “no gene can be mapped.”. That is not a problem. If you are not working with human genes, you can just skip this part.

gmt_fname <- system.file("extdata", "pathways.gmt", package = "CEMiTool")
gmt_in <- pathwayPCA::read_gmt(gmt_fname)
fc <- mod_ora(fc, gmt_in)

# In Seurat's sample data, pbmc small, no enrichments are found. 
# That is way plot_ora is commented out.

# fc <- plot_ora(fc)

save_plots(name = "fcoex_vignette_Seurat", fc, force = TRUE, directory = "./Plots")

Plotting clusters with Seurat

One application of fcoex is to find overlapping populations of cells in the dataset, by employing a module-based view of the gene expression landscape.

Why reclustering? Well, the algorithm behind fcoex considers at the same time inverse and direct correlations. Thus, it is nonsense to obtain a plot of the average expression of the genes in a module, for example.
The reclustering allows us to gather all the signal in the dataset, positive and negative, to see how cells in the dataset behave in relation to that subpart of the transcriptomic space.

Now let’s recluster the cells for some of the fcoex module and visualize it in Seurat .

The cells will be divided on two groups: header positive (HP) and header negative (HN). For details on why, see the next session on anti-correlated genes.

library(gridExtra)

fc <- recluster(fc)

## Detecting clusters for the following modules:

## [1] "S100A8"

## S100A8

## [1] "HLA-DPB1"

## HLA-DPB1

## [1] "TYMP"

## TYMP

## [1] "HLA-DRB1"

## HLA-DRB1

## [1] "MS4A1"

## MS4A1

## [1] "LYZ"

## LYZ

## [1] "CD7"

## CD7

## [1] "IFI6"

## IFI6

## [1] "HLA-DQA2"

## HLA-DQA2

# Running UMAP to obtain layout for cells
pbmc_small <- RunUMAP(pbmc_small, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 17:32:42 UMAP embedding parameters a = 0.9922 b = 1.112

## 17:32:42 Read 80 rows and found 10 numeric columns

## 17:32:42 Using Annoy for neighbor search, n_neighbors = 30

## 17:32:42 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 17:32:42 Writing NN index file to temp file /tmp/Rtmp69dsdH/file19cbde29e13866
## 17:32:42 Searching Annoy index using 1 thread, search_k = 3000
## 17:32:42 Annoy recall = 100%
## 17:32:44 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 17:32:44 7 smooth knn distance failures
## 17:32:46 Initializing from normalized Laplacian + noise (using irlba)
## 17:32:46 Commencing optimization for 500 epochs, with 2410 positive edges
## 17:32:48 Optimization finished

plot1 <- DimPlot(pbmc_small)

pbmc_reclustered <- pbmc_small

# S100A8 is a marker of some monocytes

Idents(pbmc_reclustered) <- idents(fc)[["S100A8"]]
plot2  <- DimPlot(pbmc_reclustered, reduction = 'umap', cols = c("darkgreen", "dodgerblue3")) +
  ggtitle("S100A8")
    

# HLA-DPB1 is a marker of Antigen Presenting Cells
Idents(pbmc_reclustered) <- idents(fc)[["HLA-DPB1"]]
plot3  <- DimPlot(pbmc_reclustered, reduction = 'umap', cols = c("darkgreen", "dodgerblue3")) +
  ggtitle("HLA-DPB1")

# MS4A1 is a marker of B cells
Idents(pbmc_reclustered) <- idents(fc)[["MS4A1"]]
plot4  <- DimPlot(pbmc_reclustered, reduction = 'umap', cols = c("darkgreen", "dodgerblue3")) +
  ggtitle("MS4A1")

grid.arrange(plot1, plot2, plot3, plot4, nrow=2)

If you notice the “HP” clusters, you will see that the “HLA-DPB1” recluster covers roughly the same cells as the combination of the “S100A8” recluster and the “MS4A1” reclusters. This is expected, as both B cells and monocytes are antigen presenting cells.

Notice that these overlapped insights are not possible looking by one single flat output.

Detecting anticorrelated genes in the modules

The fcoex modules capture also negative correlations. hese can be especially interesting, as they point to complementary cell classes.

Each module has one “header” gene. The other genes in the module are either in the same cells of the header (thus, header-positive cells, or HP), or present in the other cells of the dataset (header negative).

For all the modules, we will check to see if such anti-correlations are present .

for (i in names(module_genes(fc))) {
  Idents(pbmc_reclustered) <-   fc@mod_idents[[i]]
  
  print(paste("Checking for anticorrelation in module", i))
 
  # Identify markers only for module genes:
  module_genes_in_clusters <-
    FindAllMarkers(
      pbmc_reclustered,
      logfc.threshold = 1,
      only.pos = TRUE,
      features = fc@module_list[[i]],
      verbose = FALSE
    )
  
  # If there are markers of the HN cluster, it means that we have anticorrelation
  if ("HN" %in% module_genes_in_clusters$cluster) {
    module_genes_in_clusters$module = i
    message(paste0("anticorrelated genes found for module ", i))
  }
}

## [1] "Checking for anticorrelation in module S100A8"
## [1] "Checking for anticorrelation in module HLA-DPB1"
## [1] "Checking for anticorrelation in module TYMP"
## [1] "Checking for anticorrelation in module HLA-DRB1"
## [1] "Checking for anticorrelation in module MS4A1"
## [1] "Checking for anticorrelation in module LYZ"
## [1] "Checking for anticorrelation in module CD7"
## [1] "Checking for anticorrelation in module IFI6"
## [1] "Checking for anticorrelation in module HLA-DQA2"

There seems that only in the module HLA-DRB1 we have negative correlations. Let’s visualize then

Idents(pbmc_reclustered) <-   fc@mod_idents[["HLA-DRB1"]]

markers_for_cluster_HLA_DRB1 <-
    FindAllMarkers(
      pbmc_reclustered,
      logfc.threshold = 1,
      only.pos = TRUE,
      features = fc@module_list[["HLA-DRB1"]],
      verbose = FALSE
    )

print(markers_for_cluster_HLA_DRB1)

##                 p_val avg_log2FC pct.1 pct.2    p_val_adj cluster     gene
## TUBB1    1.642052e-03   6.112278 0.225 0.000 3.776720e-01      HN    TUBB1
## GP9      1.642052e-03   5.401909 0.225 0.000 3.776720e-01      HN      GP9
## NGFRAP1  1.642052e-03   4.828998 0.225 0.000 3.776720e-01      HN  NGFRAP1
## HLA-DRB1 2.464537e-15   6.308812 0.975 0.050 5.668436e-13      HP HLA-DRB1
## HLA-DRA  2.981826e-13   4.489918 0.975 0.400 6.858199e-11      HP  HLA-DRA
## HLA-DRB5 1.046771e-12   4.930441 0.875 0.025 2.407574e-10      HP HLA-DRB5
## HLA-DPA1 1.764119e-12   4.544636 0.900 0.150 4.057473e-10      HP HLA-DPA1
## HLA-DQB1 6.587968e-10   4.585457 0.725 0.025 1.515233e-07      HP HLA-DQB1
## HLA-DMB  1.089737e-07   3.727532 0.600 0.025 2.506394e-05      HP  HLA-DMB
## HLA-DQA1 2.018550e-07   4.452150 0.600 0.050 4.642664e-05      HP HLA-DQA1
## HLA-DMA  1.073164e-06   3.216338 0.575 0.050 2.468277e-04      HP  HLA-DMA
## CFP      2.045245e-06   2.701739 0.575 0.050 4.704064e-04      HP      CFP
## LY86     2.536648e-05   3.931002 0.425 0.025 5.834291e-03      HP     LY86
## LGALS2   4.185935e-05   2.697375 0.475 0.050 9.627650e-03      HP   LGALS2
## NFKBIA   1.147847e-04   1.123156 0.800 0.275 2.640049e-02      HP   NFKBIA
## HVCN1    4.187025e-04   5.321594 0.275 0.000 9.630157e-02      HP    HVCN1
## ASGR1    1.642052e-03   3.106563 0.225 0.000 3.776720e-01      HP    ASGR1

We can see that the HLA genes are in the “HP” direction and TUBB1, GP9 and NGFRAP1 are in the opposing direction.

Let’s look at the expression patterns in the UMAP plot.

TUBB1 <- FeaturePlot(pbmc_small, "TUBB1")
DRB1 <-  FeaturePlot(pbmc_small, "HLA-DRB1")

grid.arrange(TUBB1, DRB1, ncol = 2)

The completeness is not perfect (it is a small example dataset), but we can already see the disjoint pattern of expression.

The anticorrelations provide an additional layer of exploration of the dataset, and might indicate complementary populations of cells, like senders and receivers, or disjoint expression programs.

fcoex: co-expression for single-cell data integrated with Seurat

Tiago Lubiana

Integration to Seurat

Plotting clusters with Seurat

Detecting anticorrelated genes in the modules