CelliD Vignette

Akira Cortal1 and Antonio Rausell1*

1Paris University, Imagine Institute, 75015 Paris, France, EU

19 May 2021

Abstract

CelliD is a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell RNA-seq. CelliD allows unbiased cell identity recognition across different donors, tissues-of-origin, model organisms and single-cell omics protocols. We present here the main functionalities of the CelliD R package through several use cases. This vignette illustrates the use of the CelliD R package v0.99 available at https://github.com/RausellLab/CelliD.

Package

CelliD 1.0.0

1 Contact
2 Installation
- 2.1 Common installation issues
3 Data and libraries
- 3.1 Load libraries
- 3.2 Load Data
4 Data input and preprocessing steps
5 CelliD dimensionality reduction through MCA
- 5.1 Alternative state-of-the-art dimensionality reduction techniques
6 CelliD automatic cell type prediction using pre-established marker lists
7 CelliD cell-to-cell matching and label transferring across datasets
8 CelliD per-cell functionnal annotation against functional ontologies and pathway databases
Session info

1 Contact

akira.cortal@institutimagine.org antonio.rausell@institutimagine.org

2 Installation

CelliD is based on R version >= 4.1. It contains dependencies with several CRAN and Biocondutor packages as described in the Description file

To install CelliD, set the repositories option to enable downloading Bioconductor dependencies:

if(!require("tidyverse")) install.packages("tidyverse")
if(!require("ggpubr")) install.packages("ggpubr")
BiocManager::install("CelliD")

2.1 Common installation issues

macOS users might experience installation issues related to Gfortran library. To solve this, download and install the appropriate gfortran dmg file from https://github.com/fxcoudert/gfortran-for-macOS

3 Data and libraries

3.1 Load libraries

library(CelliD)
library(tidyverse) # general purpose library for data handling
library(ggpubr) #library for plotting

3.2 Load Data

To illustrate CelliD usage we will use throughout this vignette two publicly available pancreas single-cell RNA-seq data sets provided in Baron et al. 2016 and Segerstolpe et al. 2016. We provide here for convenience the R objects with the raw counts gene expression matrices and associated metadata

#read data
BaronMatrix   <- readRDS(url("https://storage.googleapis.com/cellid-cbl/BaronMatrix.rds"))
BaronMetaData <- readRDS(url("https://storage.googleapis.com/cellid-cbl/BaronMetaData.rds"))
SegerMatrix   <- readRDS(url("https://storage.googleapis.com/cellid-cbl/SegerstolpeMatrix.rds"))
SegerMetaData <- readRDS(url("https://storage.googleapis.com/cellid-cbl/SegerstolpeMetaData2.rds"))

4 Data input and preprocessing steps

4.1 Restricting the analysis to protein-coding genes

While CelliD can handle all types of genes, we recommend restricting the analysis to protein-coding genes. CelliD package provides two gene lists (HgProteinCodingGenes and MgProteinCodingGenes) containing a background set of 19308 and 21914 protein-coding genes from human and mouse, respectively, obtained from BioMart Ensembl release 100, version April 2020 (GrCH38.p13 for human, and GRCm38.p6 for mouse). Gene identifiers here correspond to official gene symbols.

# Restricting to protein-coding genes
data("HgProteinCodingGenes")
BaronMatrixProt <- BaronMatrix[rownames(BaronMatrix) %in% HgProteinCodingGenes,]
SegerMatrixProt <- SegerMatrix[rownames(SegerMatrix) %in% HgProteinCodingGenes,]

4.2 Data input formats

CelliD use as input single cell data in the form of specific S4objects. Currently supported files are SingleCellExperiment from Bioconductor and Seurat Version 3 from CRAN. For downstream analyses, gene identifiers corresponding to official gene symbols are used.

4.3 Creating a Seurat object, and processing and normalization of the gene expression matrix

Library size normalization is carried out by rescaling counts to a common library size of 10000. Log transformation is performed after adding a pseudo-count of 1. Genes expressing at least one count in less than 5 cells are removed. These steps mimic the standard Seurat workflow described here.

# Create Seurat object  and remove remove low detection genes
Baron <- CreateSeuratObject(counts = BaronMatrixProt, project = "Baron", min.cells = 5, meta.data = BaronMetaData)
Seger <- CreateSeuratObject(counts = SegerMatrixProt, project = "Segerstolpe", min.cells = 5, meta.data = SegerMetaData)

# Library-size normalization, log-transformation, and centering and scaling of gene expression values
Baron <- NormalizeData(Baron)
Baron <- ScaleData(Baron, features = rownames(Baron))

While this vignette only illustrates the use of CelliD on single-cell RNA-seq data, we note that the package can handle other types of gene matrices, e.g sci-ATAC gene activity score matrices.

5 CelliD dimensionality reduction through MCA

CelliD is based on Multiple Correspondence Analysis (MCA), a multivariate method that allows the simultaneous representation of both cells and genes in the same low dimensional vector space. In such space, euclidean distances between genes and cells are computed, and per-cell gene rankings are obtained. The top ‘n’ closest genes to a given cell will be defined as its gene signature. Per-cell gene signatures will be obtained in a later section of this vignette.

To perform MCA dimensionality reduction, the command RunMCA is used:

Baron <- RunMCA(Baron)

## Computing Fuzzy Matrix

## 6.004 sec elapsed

## Computing SVD

## 129.918 sec elapsed

## Computing Coordinates

## 26.218 sec elapsed

The DimPlotMC command allows to visualize both cells and selected gene lists in MCA low dimensional space.

DimPlotMC(Baron, reduction = "mca", group.by = "cell.type", features = c("CTRL", "INS", "MYZAP", "CDH11"), as.text = TRUE) + ggtitle("MCA with some key gene markers")

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

In the previous plot we colored cells according to the pre-established cell type annotations available as metadata (“cell.type”) and as provided in Baron et al. 2016. No clustering step was performed here. In the common scenario where such annotations were not available, the DimPlotMC function would represent cells as red colored dots, and genes as black crosses. To represent all genes in the previous plot, just remove the “features” parameter from the previous command, so that the DimPlotMC function takes the default value.

5.1 Alternative state-of-the-art dimensionality reduction techniques

For the sake of comparisson, state-of-the-art dimensionality reduction techniques such as PCA, UMAP and tSNE can be obtained as follows:

Baron <- RunPCA(Baron, features = rownames(Baron))

## PC_ 1 
## Positive:  IFITM3, ZFP36L1, CDC42EP1, TMSB10, PMEPA1, YBX3, SOX4, IL32, TMSB4X, RHOC 
##     S100A11, TACSTD2, TM4SF1, KRT7, IFITM2, CD44, FLNA, MSN, S100A16, MYH9 
##     SDC4, LGALS3, SERPING1, PRSS8, SERPINA3, ITGA2, COL18A1, GSTP1, SAT1, DDIT4 
## Negative:  PCSK1N, CPE, SCG5, PTPRN, CHGA, CHGB, GNAS, SCGN, SCG2, CD99 
##     PAM, PEMT, VGF, PPP1R1A, BEX1, PCSK2, KIAA1324, TTR, PAX6, IDS 
##     NLRP1, SLC30A8, EEF1A2, UCHL1, GAD2, APLP1, SYT7, NEUROD1, ERO1B, TMOD1 
## PC_ 2 
## Positive:  CELA3A, PRSS1, CELA3B, CTRC, CTRB1, PRSS2, CLPS, CPA1, PNLIPRP1, PLA2G1B 
##     CPA2, CELA2A, CTRB2, CPB1, KLK1, SYCN, REG1A, PNLIP, REG1B, GP2 
##     CUZD1, SPINK1, PRSS3, CTRL, CELA2B, FOXD2, REG3G, MGST1, PDIA2, AMY2A 
## Negative:  PSAP, IGFBP7, COL18A1, APP, CALR, ITGB1, PKM, LGALS3BP, CD81, ITM2B 
##     GNAI2, GRN, APLP2, CTSD, CDK2AP1, LAMP1, PXDN, SPARC, ATP1A1, MMP14 
##     ITGA5, PLXNB2, TIMP1, FSTL1, COL4A1, CALD1, CTSB, NBL1, HSPG2, FLNA 
## PC_ 3 
## Positive:  SPARC, COL4A1, ENG, IGFBP4, COL6A2, PDGFRB, BGN, PXDN, COL15A1, COL3A1 
##     COL1A2, CYGB, AEBP1, ADAMTS4, COL6A3, HTRA3, EMILIN1, LRRC32, CRISPLD2, COL1A1 
##     COL5A3, C11orf96, LAMA4, MMP2, COL6A1, COL5A1, ITGA1, TIMP3, HIC1, COL5A2 
## Negative:  KRT19, TACSTD2, KRT7, SPINT2, ELF3, KRT8, ANXA4, PRSS8, SERINC2, SPINT1 
##     CDH1, LCN2, CLDN7, CLDN4, CFB, CD24, SERPINA5, CFTR, KRT18, CLDN1 
##     MMP7, LGALS4, ST14, C3, SDC4, LAD1, S100A14, SLC4A4, PTPRF, ABCC3 
## PC_ 4 
## Positive:  INS, HADH, IAPP, PCSK1, ADCYAP1, TIMP2, IGF2, DLK1, GSN, NPTX2 
##     TSPAN13, PDX1, VAT1L, RPS25, WLS, RPL7, RPL24, MAFA, SLC6A6, SYT13 
##     ELMO1, DNAJB9, UCHL1, RPL3, RGS16, RPS6, RBP4, PKIB, TGFBR3, RPL17 
## Negative:  GCG, GC, TM4SF4, CLU, TMEM176B, IRX2, TTR, TMEM176A, FEV, F10 
##     FXYD5, CRYBA2, SMIM24, SERPINA1, FXYD3, PCSK2, ARX, ALDH1A1, LOXL4, B2M 
##     PAPPA2, CD82, GPX3, FXYD6, GLS, MUC13, VSTM2L, EGFL7, RGS4, COTL1 
## PC_ 5 
## Positive:  COL6A3, COL1A2, PDGFRB, COL1A1, BGN, COL3A1, CRLF1, COL5A1, SFRP2, CYGB 
##     THBS2, EMILIN1, COL6A1, FMOD, VCAN, COL6A2, PRRX1, COL5A2, COL5A3, ADAMTS12 
##     CRISPLD2, LAMC3, AEBP1, NOTCH3, MXRA8, LUM, VSTM4, TPM2, INHBA, FN1 
## Negative:  PLVAP, PECAM1, CD93, FLT1, KDR, SOX18, PODXL, ACVRL1, RGCC, VWF 
##     ROBO4, TIE1, ESM1, ADGRL4, ESAM, ECSCR, S1PR1, SEMA3F, CXCR4, CLDN5 
##     PASK, NOTCH4, ERG, DYSF, CDH5, CYP1A1, ANGPT2, BCL6B, IFI27, PTPRB

Baron <- RunUMAP(Baron, dims = 1:30)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 17:22:28 UMAP embedding parameters a = 0.9922 b = 1.112

## 17:22:28 Read 8569 rows and found 30 numeric columns

## 17:22:28 Using Annoy for neighbor search, n_neighbors = 30

## 17:22:28 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 17:22:29 Writing NN index file to temp file /tmp/Rtmpetmb35/file2d7133103fe6c1
## 17:22:29 Searching Annoy index using 1 thread, search_k = 3000
## 17:22:32 Annoy recall = 100%
## 17:22:33 Commencing smooth kNN distance calibration using 1 thread
## 17:22:34 Initializing from normalized Laplacian + noise
## 17:22:35 Commencing optimization for 500 epochs, with 351608 positive edges
## 17:22:47 Optimization finished

Baron <- RunTSNE(Baron, dims = 1:30)
PCA  <- DimPlot(Baron, reduction = "pca", group.by = "cell.type")  + ggtitle("PCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
tSNE <- DimPlot(Baron, reduction = "tsne", group.by = "cell.type")+ ggtitle("tSNE") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
UMAP <- DimPlot(Baron, reduction = "umap", group.by = "cell.type") + ggtitle("UMAP") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
MCA <- DimPlot(Baron, reduction = "mca", group.by = "cell.type")  + ggtitle("MCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggarrange(PCA, MCA, common.legend = TRUE, legend = "top")

ggarrange(tSNE, UMAP, common.legend = TRUE, legend = "top")

6 CelliD automatic cell type prediction using pre-established marker lists

At this stage, CelliD can perform an automatic cell type prediction for each cell in the dataset. For that purpose, prototypical marker lists associated with well-characterized cell types are used as input, as obtained from third-party sources. Here we will use the Panglao database of curated gene signatures to predict the cell type of each individual cell in the Baron data.

We will illustrate the procedure with two collections of cell-type gene signatures: first restricting the assessment to known pancreatic cell types, and second, a more challenging and unbiased scenario where all cell types in the database will be evaluated. Alternative gene signature databases and/or custom made marker lists can be used by adapting their input format as described below. The quality of the predictions is obviously highly dependent on the quality of the cell type signatures.

6.1 Obtaining pancreatic cell-type gene signatures

# download all cell-type gene signatures from panglaoDB
panglao <- read_tsv("https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   species = col_character(),
##   `official gene symbol` = col_character(),
##   `cell type` = col_character(),
##   nicknames = col_character(),
##   `ubiquitousness index` = col_double(),
##   `product description` = col_character(),
##   `gene type` = col_character(),
##   `canonical marker` = col_double(),
##   `germ layer` = col_character(),
##   organ = col_character(),
##   sensitivity_human = col_double(),
##   sensitivity_mouse = col_double(),
##   specificity_human = col_double(),
##   specificity_mouse = col_double()
## )

# restricting the analysis to pancreas specific gene signatues
panglao_pancreas <- panglao %>% filter(organ == "Pancreas")

# restricting to human specific genes
panglao_pancreas <- panglao_pancreas %>%  filter(str_detect(species,"Hs"))

# converting dataframes into a list of vectors, which is the format needed as input for CelliD
panglao_pancreas <- panglao_pancreas %>%  
  group_by(`cell type`) %>%  
  summarise(geneset = list(`official gene symbol`))
pancreas_gs <- setNames(panglao_pancreas$geneset, panglao_pancreas$`cell type`)

6.2 Obtaining gene signatures for all cell types in the Panglao database

#filter to get human specific genes
panglao_all <- panglao %>%  filter(str_detect(species,"Hs"))

# convert dataframes to a list of named vectors which is the format for CelliD input
panglao_all <- panglao_all %>%  
  group_by(`cell type`) %>%  
  summarise(geneset = list(`official gene symbol`))
all_gs <- setNames(panglao_all$geneset, panglao_all$`cell type`)

#remove very short signatures
all_gs <- all_gs[sapply(all_gs, length) >= 10]

6.3 Assessing per-cell gene signature enrichments against pre-established marker lists

A per-cell assessment is performed, where the enrichment of each cell’s gene signature against each cell-type marker lists is evaluated through hypergeometric tests. No intermediate clustering steps are used here. By default, the size n of the cell’s gene signature is set to n.features = 200

By default, only reference gene sets of size ≥10 are considered. In addition, hypergeometric test p-values are corrected by multiple testing for the number of gene sets evaluated. A cell is considered as enriched in those gene sets for which the hypergeometric test p-value is <1e-02 (-log10 corrected p-value >2), after Benjamini Hochberg multiple testing correction. Default settings can be modified within the RunCellHGT function.

The RunCellHGT function will provide the -log10 corrected p-value for each cell and each signature evaluated, so a multi-class evaluation is enabled. When a disjointed classification is required, a cell will be assigned to the gene set with the lowest significant corrected p-value. If no significant hits are found, a cell will remain unassigned.

# Performing per-cell hypergeometric tests against the gene signature collection
HGT_pancreas_gs <- RunCellHGT(Baron, pathways = pancreas_gs, dims = 1:50, n.features = 200)

## 
## calculating distance

## ranking genes

## 10 pathways kept for hypergeometric test out of 10, 0 filtered as less than 10 features was present in the data

## 
## calculating features overlap

## performing hypergeometric test

# For each cell, assess the signature with the lowest corrected p-value (max -log10 corrected p-value)
pancreas_gs_prediction <- rownames(HGT_pancreas_gs)[apply(HGT_pancreas_gs, 2, which.max)]

# For each cell, evaluate if the lowest p-value is significant
pancreas_gs_prediction_signif <- ifelse(apply(HGT_pancreas_gs, 2, max)>2, yes = pancreas_gs_prediction, "unassigned")

# Save cell type predictions as metadata within the Seurat object
Baron$pancreas_gs_prediction <- pancreas_gs_prediction_signif

The previous cell type predictions can be visualized on any low-dimensionality representation of choice, as illustrated here using tSNE plots

# Comparing the original labels with CelliD cell-type predictions based on pancreas-specific gene signatures
color <- c("#F8766D", "#E18A00", "#BE9C00", "#8CAB00", "#24B700", "#00BE70", "#00C1AB", "#00BBDA", "#00ACFC", "#8B93FF", "#D575FE", "#F962DD", "#FF65AC", "grey")
ggcolor <- setNames(color,c(sort(unique(Baron$cell.type)), "unassigned"))
OriginalPlot <- DimPlot(Baron, reduction = "tsne", group.by = "cell.type") + 
  scale_color_manual(values = ggcolor) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)
Predplot1 <- DimPlot(Baron, reduction = "tsne", group.by = "pancreas_gs_prediction") + 
  scale_color_manual(values = ggcolor) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggarrange(OriginalPlot, Predplot1, legend = "top",common.legend = TRUE)

From an unbiased perspective, CelliD cell type prediction can be performed using as input a comprehensive set of cell types that are not necessarily restricted to the organ or tissue under study. To illustrate this scenario all cell types in the Panglao database can be evaluated at once.

HGT_all_gs <- RunCellHGT(Baron, pathways = all_gs, dims = 1:50)

## 
## calculating distance

## ranking genes

## 141 pathways kept for hypergeometric test out of 151, 10 filtered as less than 10 features was present in the data

## 
## calculating features overlap

## performing hypergeometric test

all_gs_prediction <- rownames(HGT_all_gs)[apply(HGT_all_gs, 2, which.max)]
all_gs_prediction_signif <- ifelse(apply(HGT_all_gs, 2, max)>2, yes = all_gs_prediction, "unassigned")

# For the sake of visualization, we group under the label "other" diverse cell types for which significant enrichments were found:
Baron$all_gs_prediction <- ifelse(all_gs_prediction_signif %in% c(names(pancreas_gs), "Schwann cells", "Endothelial cells", "Macrophages", "Mast cells", "T cells","Fibroblasts", "unassigned"), all_gs_prediction_signif,"other")

color <- c("#F8766D", "#E18A00", "#BE9C00", "#8CAB00", "#24B700", "#00BE70", "#00C1AB", "#00BBDA", "#00ACFC", "#8B93FF", "#D575FE", "#F962DD", "#FF65AC", "#D575FE", "#F962DD", "grey", "black")
ggcolor <- setNames(color,c(sort(unique(Baron$cell.type)), "Fibroblasts", "Schwann cells", "unassigned", "other"))
Baron$pancreas_gs_prediction <- factor(Baron$pancreas_gs_prediction,c(sort(unique(Baron$cell.type)), "Fibroblasts", "Schwann cells", "unassigned", "other"))
PredPlot2 <- DimPlot(Baron, group.by = "all_gs_prediction", reduction = "tsne") + 
  scale_color_manual(values = ggcolor, drop = FALSE) + 
  theme(legend.text = element_text(size = 10), aspect.ratio = 1)
ggarrange(OriginalPlot, PredPlot2, legend = "top", common.legend = TRUE)

7 CelliD cell-to-cell matching and label transferring across datasets

CelliD performs cell-to-cell matching across independent single-cell datasets. Datasets can originate from the same or from a different tissue / organ. Cell matching can be performed either within (e.g. human-to-human) or across species (e.g. mouse-to-human). CelliD cell matching across datasets is performed by a per-cell assessment in the query dataset evaluating the replication of gene signatures extracted from the reference dataset. Gene signatures from the reference dataset can be automatically derived either from individual cells (CelliD(c)), or from previously-established groups of cells (CelliD(g)).

7.1 CelliD(c) and CelliD(g): per-cell and per-group gene signature extraction from a reference dataset

In CelliD(c), the gene signatures extracted for each cell n in a dataset D can be assessed through their enrichment against the gene signatures extracted for each cell n’ in a reference dataset D’. Alternatively, CelliD(g) takes advantage of a grouping of cells in D, where per-group gene signatures are extracted and evaluated against the gene signatures for each cell n in the query dataset D. We note that the groupings used in CelliD(g) should be provided as input and tipically originate from a manual annotation process.

Analogous to the previous section, CelliD(c) and CelliD(g) evaluate such enrichments through hypergeometric tests, and p-values are corrected by multiple testing for the number of cells or the number of groups against which they are evaluated. Best hits can be used for cell-to-cell matching (CelliD(c)) or group-to-cell matching (CelliD(g) and subsequent label transferring across datasets. If no significant hits are found, a cell will remain unassigned.

Here we illustrate CelliD(c) and CelliD(g) using the Baron dataset as a reference set from which both cell and group signatures are extracted. The Segerstolpe dataset is used as the query set on which the cell-type labels previously annotated in the Baron dataset are transferred.

# Extracting per-cell gene signatures from the Baron dataset with CelliD(c)
Baron_cell_gs <- GetCellGeneSet(Baron, dims = 1:50, n.features = 200)

## 
## calculating distance

## 
## creating ranking

## 
## creating geneset

# Extracting per-group gene signatures from the Baron dataset with CelliD(g)
Baron_group_gs <- GetGroupGeneSet(Baron, dims = 1:50, n.features = 200, group.by = "cell.type")

## 
## creating ranking

7.2 Preprocessing and MCA assessment of the Segerstolpe dataset used as the query set

# Normalization, basic preprocessing and MCA dimensionality reduction assessment
Seger <- NormalizeData(Seger)
Seger <- FindVariableFeatures(Seger)
Seger <- ScaleData(Seger)
Seger <- RunMCA(Seger, nmcs = 50)

## 1.543 sec elapsed
## 29.643 sec elapsed
## 4.115 sec elapsed

Seger <- RunPCA(Seger)
Seger <- RunUMAP(Seger, dims = 1:30)
Seger <- RunTSNE(Seger, dims = 1:30)
tSNE <- DimPlot(Seger, reduction = "tsne", group.by = "cell.type", pt.size = 0.1) + ggtitle("tSNE") + theme(aspect.ratio = 1)
UMAP <- DimPlot(Seger, reduction = "umap", group.by = "cell.type", pt.size = 0.1) + ggtitle("UMAP") + theme(aspect.ratio = 1)
ggarrange(tSNE, UMAP, common.legend = TRUE, legend = "top")

7.3 CelliD(c): cell-to-cell matching and label transferring across datasets

HGT_baron_cell_gs <- RunCellHGT(Seger, pathways = Baron_cell_gs, dims = 1:50)

## 
## calculating distance

## ranking genes

## 8569 pathways kept for hypergeometric test out of 8569, 0 filtered as less than 10 features was present in the data

## 
## calculating features overlap

## performing hypergeometric test

baron_cell_gs_match <- rownames(HGT_baron_cell_gs)[apply(HGT_baron_cell_gs, 2, which.max)]
baron_cell_gs_prediction <- Baron$cell.type[baron_cell_gs_match]
baron_cell_gs_prediction_signif <- ifelse(apply(HGT_baron_cell_gs, 2, max)>2, yes = baron_cell_gs_prediction, "unassigned")
Seger$baron_cell_gs_prediction <- baron_cell_gs_prediction_signif

color <- c("#F8766D", "#E18A00", "#BE9C00", "#8CAB00", "#24B700", "#00BE70", "#00C1AB", "#00BBDA", "#00ACFC", "#8B93FF", "#D575FE", "#F962DD", "#FF65AC", "grey")
ggcolor <- setNames(color,c(sort(unique(Baron$cell.type)), "unassigned"))

ggPredictionsCellMatch <- DimPlot(Seger, group.by = "baron_cell_gs_prediction", pt.size = 0.2, reduction = "tsne") + ggtitle("Predicitons") + scale_color_manual(values = ggcolor, drop =FALSE) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggOriginal <- DimPlot(Seger, group.by = "cell.type", pt.size = 0.2, reduction = "tsne") + ggtitle("Original") + scale_color_manual(values = ggcolor) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggarrange(ggOriginal, ggPredictionsCellMatch, legend = "top", common.legend = TRUE)

7.4 CelliD(g): group-to-cell matching and label transferring across datasets

HGT_baron_group_gs <- RunCellHGT(Seger, pathways = Baron_group_gs, dims = 1:50)

## 
## calculating distance

## ranking genes

## 13 pathways kept for hypergeometric test out of 13, 0 filtered as less than 10 features was present in the data

## 
## calculating features overlap

## performing hypergeometric test

baron_group_gs_prediction <- rownames(HGT_baron_group_gs)[apply(HGT_baron_group_gs, 2, which.max)]
baron_group_gs_prediction_signif <- ifelse(apply(HGT_baron_group_gs, 2, max)>2, yes = baron_group_gs_prediction, "unassigned")
Seger$baron_group_gs_prediction <- baron_group_gs_prediction_signif

color <- c("#F8766D", "#E18A00", "#BE9C00", "#8CAB00", "#24B700", "#00BE70", "#00C1AB", "#00BBDA", "#00ACFC", "#8B93FF", "#D575FE", "#F962DD", "#FF65AC", "grey")
ggcolor <- setNames(color,c(sort(unique(Baron$cell.type)), "unassigned"))

ggPredictions <- DimPlot(Seger, group.by = "baron_group_gs_prediction", pt.size = 0.2, reduction = "tsne") + ggtitle("Predicitons") + scale_color_manual(values = ggcolor, drop =FALSE) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggOriginal <- DimPlot(Seger, group.by = "cell.type", pt.size = 0.2, reduction = "tsne") + ggtitle("Original") + scale_color_manual(values = ggcolor) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggarrange(ggOriginal, ggPredictions, legend = "top", common.legend = TRUE)

8 CelliD per-cell functionnal annotation against functional ontologies and pathway databases

Once MCA is performed, per-cell signatures can be evaluated against any custom collection of gene signatures which can, e.g. represent functional terms or biological pathways. This allows CelliD to perfome a per-cell functional enrichment analysis enabling biological interpretation of cell’s state. We illustrate here how to mine for that purpose a collection 7 sources of functional annotations: KEGG, Hallmark MSigDB, Reactome, WikiPathways, GO biological process, GO molecular function and GO cellular component. Gene sets associated to functional pathways and ontology terms can be obtained from enrichr

Here we illustrate the use of Hallmark in HyperGeometric test and integrate the results into the Seurat object to visualize the -log10 p-value of the enrichment into an UMAP.

# Downloading functional gene sets:
# For computational reasons, we just developed here the assessment on Hallmark MSigDB gene sets that is provided in the package
# Other functionnal pathways can be obtained using the msigdbr package

# library(msigdbr)
# msigdf <- msigdbr::msigdbr(species = "Homo sapiens")
# msigdf_filtered <- msigdf %>% select(gs_cat, gs_subcat, gs_name, gene_symbol, gs_description) %>%
#   filter(gs_subcat == "CP:KEGG") %>%
#   mutate(gs_subcat = ifelse(gs_subcat == "", "Hallmark", gs_subcat))
# KEGGgenesetDF <- msigdf_filtered %>%  mutate(gs_name = str_replace(sub("^.*?_{1}", "", gs_name), "_", "\\-")) %>%
#   group_by(gs_subcat, gs_name) %>%
#   summarise(geneset = list(gene_symbol)) %>%
#   group_by(gs_subcat) %>%
#   summarise(all_geneset = list(setNames(geneset, gs_name))) 
# KEGGgeneset <- setNames(KEGGgenesetDF$all_geneset, KEGGgenesetDF$gs_subcat)

# Assessing per-cell functional enrichment analyses
HGT_Hallmark <- RunCellHGT(Seger, pathways = Hallmark, dims = 1:50)

## 
## calculating distance

## ranking genes

## 50 pathways kept for hypergeometric test out of 50, 0 filtered as less than 10 features was present in the data

## 
## calculating features overlap

## performing hypergeometric test

#Integrating functional annotations as an "assay" slot in the Seurat objects

Seger@assays[["Hallmark"]] <- CreateAssayObject(HGT_Hallmark)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

# Visualizing per-cell functional enrichment annotations in a dimensionality-reduction representation of choice (e.g. MCA, PCA, tSNE, UMAP)
ggG2Mcell <- FeaturePlot(Seger, "G2M-CHECKPOINT", order = TRUE, reduction = "tsne", min.cutoff = 2) + 
  theme(legend.text = element_text(size =10), aspect.ratio = 1)

## Warning: Could not find G2M-CHECKPOINT in the default search locations, found in
## Hallmark assay instead

ggG2Mcell

Session info

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggpubr_0.4.0                forcats_0.5.1              
##  [3] stringr_1.4.0               dplyr_1.0.6                
##  [5] purrr_0.3.4                 readr_1.4.0                
##  [7] tidyr_1.1.3                 tibble_3.1.2               
##  [9] ggplot2_3.3.3               tidyverse_1.3.1            
## [11] CelliD_1.0.0                SingleCellExperiment_1.14.0
## [13] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [15] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
## [17] IRanges_2.26.0              S4Vectors_0.30.0           
## [19] BiocGenerics_0.38.0         MatrixGenerics_1.4.0       
## [21] matrixStats_0.58.0          SeuratObject_4.0.1         
## [23] Seurat_4.0.1                BiocStyle_2.20.0           
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.1                reticulate_1.20          
##   [3] tidyselect_1.1.1          htmlwidgets_1.5.3        
##   [5] grid_4.1.0                BiocParallel_1.26.0      
##   [7] Rtsne_0.15                munsell_0.5.0            
##   [9] ScaledMatrix_1.0.0        codetools_0.2-18         
##  [11] ica_1.0-2                 umap_0.2.7.0             
##  [13] future_1.21.0             miniUI_0.1.1.1           
##  [15] withr_2.4.2               colorspace_2.0-1         
##  [17] highr_0.9                 knitr_1.33               
##  [19] rstudioapi_0.13           ROCR_1.0-11              
##  [21] ggsignif_0.6.1            tensor_1.5               
##  [23] listenv_0.8.0             labeling_0.4.2           
##  [25] GenomeInfoDbData_1.2.6    polyclip_1.10-0          
##  [27] farver_2.1.0              parallelly_1.25.0        
##  [29] vctrs_0.3.8               generics_0.1.0           
##  [31] xfun_0.23                 R6_2.5.0                 
##  [33] ggbeeswarm_0.6.0          rsvd_1.0.5               
##  [35] bitops_1.0-7              spatstat.utils_2.1-0     
##  [37] fgsea_1.18.0              DelayedArray_0.18.0      
##  [39] RcppArmadillo_0.10.4.0.0  assertthat_0.2.1         
##  [41] promises_1.2.0.1          scales_1.1.1             
##  [43] beeswarm_0.3.1            gtable_0.3.0             
##  [45] beachmat_2.8.0            globals_0.14.0           
##  [47] goftest_1.2-2             tictoc_1.0.1             
##  [49] rlang_0.4.11              splines_4.1.0            
##  [51] rstatix_0.7.0             lazyeval_0.2.2           
##  [53] spatstat.geom_2.1-0       broom_0.7.6              
##  [55] BiocManager_1.30.15       yaml_2.2.1               
##  [57] reshape2_1.4.4            abind_1.4-5              
##  [59] modelr_0.1.8              backports_1.2.1          
##  [61] httpuv_1.6.1              tools_4.1.0              
##  [63] bookdown_0.22             ellipsis_0.3.2           
##  [65] spatstat.core_2.1-2       jquerylib_0.1.4          
##  [67] RColorBrewer_1.1-2        ggridges_0.5.3           
##  [69] Rcpp_1.0.6                plyr_1.8.6               
##  [71] sparseMatrixStats_1.4.0   zlibbioc_1.38.0          
##  [73] RCurl_1.98-1.3            ps_1.6.0                 
##  [75] rpart_4.1-15              openssl_1.4.4            
##  [77] deldir_0.2-10             pbapply_1.4-3            
##  [79] viridis_0.6.1             cowplot_1.1.1            
##  [81] zoo_1.8-9                 haven_2.4.1              
##  [83] ggrepel_0.9.1             cluster_2.1.2            
##  [85] fs_1.5.0                  magrittr_2.0.1           
##  [87] magick_2.7.2              data.table_1.14.0        
##  [89] RSpectra_0.16-0           scattermore_0.7          
##  [91] openxlsx_4.2.3            lmtest_0.9-38            
##  [93] reprex_2.0.0              RANN_2.6.1               
##  [95] fitdistrplus_1.1-3        hms_1.1.0                
##  [97] patchwork_1.1.1           mime_0.10                
##  [99] evaluate_0.14             xtable_1.8-4             
## [101] rio_0.5.26                readxl_1.3.1             
## [103] gridExtra_2.3             compiler_4.1.0           
## [105] scater_1.20.0             KernSmooth_2.23-20       
## [107] crayon_1.4.1              htmltools_0.5.1.1        
## [109] mgcv_1.8-35               later_1.2.0              
## [111] lubridate_1.7.10          DBI_1.1.1                
## [113] dbplyr_2.1.1              MASS_7.3-54              
## [115] car_3.0-10                Matrix_1.3-3             
## [117] cli_2.5.0                 igraph_1.2.6             
## [119] pkgconfig_2.0.3           foreign_0.8-81           
## [121] plotly_4.9.3              scuttle_1.2.0            
## [123] spatstat.sparse_2.0-0     xml2_1.3.2               
## [125] vipor_0.4.5               bslib_0.2.5.1            
## [127] XVector_0.32.0            rvest_1.0.0              
## [129] digest_0.6.27             sctransform_0.3.2        
## [131] RcppAnnoy_0.0.18          spatstat.data_2.1-0      
## [133] rmarkdown_2.8             cellranger_1.1.0         
## [135] leiden_0.3.7              fastmatch_1.1-0          
## [137] uwot_0.1.10               DelayedMatrixStats_1.14.0
## [139] curl_4.3.1                shiny_1.6.0              
## [141] lifecycle_1.0.0           nlme_3.1-152             
## [143] jsonlite_1.7.2            carData_3.0-4            
## [145] BiocNeighbors_1.10.0      viridisLite_0.4.0        
## [147] askpass_1.1               fansi_0.4.2              
## [149] pillar_1.6.1              lattice_0.20-44          
## [151] fastmap_1.1.0             httr_1.4.2               
## [153] survival_3.2-11           glue_1.4.2               
## [155] zip_2.1.1                 png_0.1-7                
## [157] stringi_1.6.2             sass_0.4.0               
## [159] BiocSingular_1.8.0        irlba_2.3.3              
## [161] future.apply_1.7.0