Contents

1 Contact

2 Installation

CelliD is based on R version >= 4.1. It contains dependencies with several CRAN and Biocondutor packages as described in the Description file

To install CelliD, set the repositories option to enable downloading Bioconductor dependencies:

if(!require("tidyverse")) install.packages("tidyverse")
if(!require("ggpubr")) install.packages("ggpubr")
BiocManager::install("CelliD")

2.1 Common installation issues

macOS users might experience installation issues related to Gfortran library. To solve this, download and install the appropriate gfortran dmg file from https://github.com/fxcoudert/gfortran-for-macOS

3 Data and libraries

3.1 Load libraries

library(CelliD)
library(tidyverse) # general purpose library for data handling
library(ggpubr) #library for plotting

3.2 Load Data

To illustrate CelliD usage we will use throughout this vignette two publicly available pancreas single-cell RNA-seq data sets provided in Baron et al. 2016 and Segerstolpe et al. 2016. We provide here for convenience the R objects with the raw counts gene expression matrices and associated metadata

#read data
BaronMatrix   <- readRDS(url("https://storage.googleapis.com/cellid-cbl/BaronMatrix.rds"))
BaronMetaData <- readRDS(url("https://storage.googleapis.com/cellid-cbl/BaronMetaData.rds"))
SegerMatrix   <- readRDS(url("https://storage.googleapis.com/cellid-cbl/SegerstolpeMatrix.rds"))
SegerMetaData <- readRDS(url("https://storage.googleapis.com/cellid-cbl/SegerstolpeMetaData2.rds"))

4 Data input and preprocessing steps

4.1 Restricting the analysis to protein-coding genes

While CelliD can handle all types of genes, we recommend restricting the analysis to protein-coding genes. CelliD package provides two gene lists (HgProteinCodingGenes and MgProteinCodingGenes) containing a background set of 19308 and 21914 protein-coding genes from human and mouse, respectively, obtained from BioMart Ensembl release 100, version April 2020 (GrCH38.p13 for human, and GRCm38.p6 for mouse). Gene identifiers here correspond to official gene symbols.

# Restricting to protein-coding genes
data("HgProteinCodingGenes")
BaronMatrixProt <- BaronMatrix[rownames(BaronMatrix) %in% HgProteinCodingGenes,]
SegerMatrixProt <- SegerMatrix[rownames(SegerMatrix) %in% HgProteinCodingGenes,]

4.2 Data input formats

CelliD use as input single cell data in the form of specific S4objects. Currently supported files are SingleCellExperiment from Bioconductor and Seurat Version 3 from CRAN. For downstream analyses, gene identifiers corresponding to official gene symbols are used.

4.3 Creating a Seurat object, and processing and normalization of the gene expression matrix

Library size normalization is carried out by rescaling counts to a common library size of 10000. Log transformation is performed after adding a pseudo-count of 1. Genes expressing at least one count in less than 5 cells are removed. These steps mimic the standard Seurat workflow described here.

# Create Seurat object  and remove remove low detection genes
Baron <- CreateSeuratObject(counts = BaronMatrixProt, project = "Baron", min.cells = 5, meta.data = BaronMetaData)
Seger <- CreateSeuratObject(counts = SegerMatrixProt, project = "Segerstolpe", min.cells = 5, meta.data = SegerMetaData)

# Library-size normalization, log-transformation, and centering and scaling of gene expression values
Baron <- NormalizeData(Baron)
Baron <- ScaleData(Baron, features = rownames(Baron))

While this vignette only illustrates the use of CelliD on single-cell RNA-seq data, we note that the package can handle other types of gene matrices, e.g sci-ATAC gene activity score matrices.

5 CelliD dimensionality reduction through MCA

CelliD is based on Multiple Correspondence Analysis (MCA), a multivariate method that allows the simultaneous representation of both cells and genes in the same low dimensional vector space. In such space, euclidean distances between genes and cells are computed, and per-cell gene rankings are obtained. The top ‘n’ closest genes to a given cell will be defined as its gene signature. Per-cell gene signatures will be obtained in a later section of this vignette.

To perform MCA dimensionality reduction, the command RunMCA is used:

Baron <- RunMCA(Baron)
## Computing Fuzzy Matrix
## 10.337 sec elapsed
## Computing SVD
## 133.82 sec elapsed
## Computing Coordinates
## 15.836 sec elapsed

The DimPlotMC command allows to visualize both cells and selected gene lists in MCA low dimensional space.

DimPlotMC(Baron, reduction = "mca", group.by = "cell.type", features = c("CTRL", "INS", "MYZAP", "CDH11"), as.text = TRUE) + ggtitle("MCA with some key gene markers")
## Warning: Ignoring unknown aesthetics: text
## Ignoring unknown aesthetics: text

In the previous plot we colored cells according to the pre-established cell type annotations available as metadata (“cell.type”) and as provided in Baron et al. 2016. No clustering step was performed here. In the common scenario where such annotations were not available, the DimPlotMC function would represent cells as red colored dots, and genes as black crosses. To represent all genes in the previous plot, just remove the “features” parameter from the previous command, so that the DimPlotMC function takes the default value.

5.1 Alternative state-of-the-art dimensionality reduction techniques

For the sake of comparisson, state-of-the-art dimensionality reduction techniques such as PCA, UMAP and tSNE can be obtained as follows:

Baron <- RunPCA(Baron, features = rownames(Baron))
## PC_ 1 
## Positive:  IFITM3, ZFP36L1, CDC42EP1, TMSB10, PMEPA1, YBX3, SOX4, IL32, TMSB4X, RHOC 
##     S100A11, TACSTD2, TM4SF1, KRT7, IFITM2, CD44, FLNA, MSN, S100A16, MYH9 
##     SDC4, LGALS3, SERPING1, PRSS8, SERPINA3, ITGA2, COL18A1, GSTP1, SAT1, DDIT4 
## Negative:  PCSK1N, CPE, SCG5, PTPRN, CHGA, CHGB, GNAS, SCGN, SCG2, CD99 
##     PAM, PEMT, VGF, PPP1R1A, BEX1, PCSK2, KIAA1324, TTR, PAX6, IDS 
##     NLRP1, SLC30A8, EEF1A2, UCHL1, GAD2, APLP1, SYT7, NEUROD1, ERO1B, TMOD1 
## PC_ 2 
## Positive:  CELA3A, PRSS1, CELA3B, CTRC, CTRB1, PRSS2, CLPS, CPA1, PNLIPRP1, PLA2G1B 
##     CPA2, CELA2A, CTRB2, CPB1, KLK1, SYCN, REG1A, PNLIP, REG1B, GP2 
##     CUZD1, SPINK1, PRSS3, CTRL, CELA2B, FOXD2, REG3G, MGST1, PDIA2, AMY2A 
## Negative:  PSAP, IGFBP7, COL18A1, APP, CALR, ITGB1, PKM, LGALS3BP, CD81, ITM2B 
##     GNAI2, GRN, APLP2, CTSD, CDK2AP1, LAMP1, PXDN, SPARC, ATP1A1, MMP14 
##     ITGA5, PLXNB2, TIMP1, FSTL1, COL4A1, CALD1, CTSB, NBL1, HSPG2, FLNA 
## PC_ 3 
## Positive:  SPARC, COL4A1, ENG, IGFBP4, COL6A2, PDGFRB, BGN, PXDN, COL15A1, COL3A1 
##     COL1A2, CYGB, AEBP1, ADAMTS4, COL6A3, HTRA3, EMILIN1, LRRC32, CRISPLD2, COL1A1 
##     COL5A3, C11orf96, LAMA4, MMP2, COL6A1, COL5A1, ITGA1, TIMP3, HIC1, COL5A2 
## Negative:  KRT19, TACSTD2, KRT7, SPINT2, ELF3, KRT8, ANXA4, PRSS8, SERINC2, SPINT1 
##     CDH1, LCN2, CLDN7, CLDN4, CFB, CD24, SERPINA5, CFTR, KRT18, CLDN1 
##     MMP7, LGALS4, ST14, C3, SDC4, LAD1, S100A14, SLC4A4, PTPRF, ABCC3 
## PC_ 4 
## Positive:  INS, HADH, IAPP, PCSK1, ADCYAP1, TIMP2, IGF2, DLK1, GSN, NPTX2 
##     TSPAN13, PDX1, VAT1L, RPS25, WLS, RPL7, RPL24, MAFA, SLC6A6, SYT13 
##     ELMO1, DNAJB9, UCHL1, RPL3, RGS16, RPS6, RBP4, PKIB, TGFBR3, RPL17 
## Negative:  GCG, GC, TM4SF4, CLU, TMEM176B, IRX2, TTR, TMEM176A, FEV, F10 
##     FXYD5, CRYBA2, SMIM24, SERPINA1, FXYD3, PCSK2, ARX, ALDH1A1, LOXL4, B2M 
##     PAPPA2, CD82, GPX3, FXYD6, GLS, MUC13, VSTM2L, EGFL7, RGS4, COTL1 
## PC_ 5 
## Positive:  COL6A3, COL1A2, PDGFRB, COL1A1, BGN, COL3A1, CRLF1, COL5A1, SFRP2, CYGB 
##     THBS2, EMILIN1, COL6A1, FMOD, VCAN, COL6A2, PRRX1, COL5A2, COL5A3, ADAMTS12 
##     CRISPLD2, LAMC3, AEBP1, NOTCH3, MXRA8, LUM, VSTM4, TPM2, INHBA, FN1 
## Negative:  PLVAP, PECAM1, CD93, FLT1, KDR, SOX18, PODXL, ACVRL1, RGCC, VWF 
##     ROBO4, TIE1, ESM1, ADGRL4, ESAM, ECSCR, S1PR1, SEMA3F, CXCR4, CLDN5 
##     PASK, NOTCH4, ERG, DYSF, CDH5, CYP1A1, ANGPT2, BCL6B, IFI27, PTPRB
Baron <- RunUMAP(Baron, dims = 1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:18:30 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:18:30 Read 8569 rows and found 30 numeric columns
## 16:18:30 Using Annoy for neighbor search, n_neighbors = 30
## 16:18:30 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:18:31 Writing NN index file to temp file /tmp/RtmprhiQBx/file2bd17850c312ad
## 16:18:31 Searching Annoy index using 1 thread, search_k = 3000
## 16:18:34 Annoy recall = 100%
## 16:18:35 Commencing smooth kNN distance calibration using 1 thread
## 16:18:37 Initializing from normalized Laplacian + noise
## 16:18:37 Commencing optimization for 500 epochs, with 351608 positive edges
## 16:18:48 Optimization finished
Baron <- RunTSNE(Baron, dims = 1:30)
PCA  <- DimPlot(Baron, reduction = "pca", group.by = "cell.type")  + ggtitle("PCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
tSNE <- DimPlot(Baron, reduction = "tsne", group.by = "cell.type")+ ggtitle("tSNE") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
UMAP <- DimPlot(Baron, reduction = "umap", group.by = "cell.type") + ggtitle("UMAP") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
MCA <- DimPlot(Baron, reduction = "mca", group.by = "cell.type")  + ggtitle("MCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggarrange(PCA, MCA, common.legend = TRUE, legend = "top")