The EasyCellType
package was designed to examine an input marker list using
the databases and provide annotation recommendations in graphical outcomes.
The package refers to 3 public available marker gene data bases,
and provides two approaches to conduct the annotation anaysis:
gene set enrichment analysis(GSEA) and a modified Fisher’s exact test.
The package has been submitted to bioconductor
to achieve an easy access for researchers.
This vignette shows a simple workflow illustrating how EasyCellType package works. The data set that will be used throughout the example is freely available from 10X Genomics.
The package can be installed using BiocManager
by the following commands
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EasyCellType")
Alternatively, the package can also be installed using devtools
and launched by
library(devtools)
install_github("rx-li/EasyCellType")
After the installation, the package can be loaded with
library(EasyCellType)
We use the Peripheral Blood Mononuclear Cells (PBMC) data freely available from 10X Genomics.
The data can be downladed from
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz.
After downloading the data, it can be read using function Read10X
.
We have included the data in our package, which can be loaded with
data(pbmc_data)
We followed the standard workflow provided by Seurat
package(Hao et al. 2021) to process the PBMC data set.
The detailed technical explanations can be found in
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html.
library(Seurat)
# Initialize the Seurat object
pbmc <- CreateSeuratObject(counts = pbmc_data, project = "pbmc3k", min.cells = 3, min.features = 200)
# QC and select samples
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# Normalize the data
pbmc <- NormalizeData(pbmc)
# Identify highly variable features
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
# Perfom linear dimensional reduction
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Cluster the cells
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# Find differentially expressed features
markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Now we get the expressed markers for each cluster. We then convert the gene symbols to Entrez IDs.
library(org.Hs.eg.db)
library(AnnotationDbi)
markers$entrezid <- mapIds(org.Hs.eg.db,
keys=markers$gene, #Column containing Ensembl gene ids
column="ENTREZID",
keytype="SYMBOL",
multiVals="first")
markers <- na.omit(markers)
In case the data is measured in mouse, we would replace the package org.Hs.eg.db
with org.Mm.eg.db
and do the above analysis.
The input for EasyCellType
package should be a data frame containing Entrez IDs,
clusters and expression scores. The order of columns should follow this rule.
In each cluster, the gene should be sorted by the expression score.
library(dplyr)
markers_sort <- data.frame(gene=markers$entrezid, cluster=markers$cluster,
score=markers$avg_log2FC) %>%
group_by(cluster) %>%
mutate(rank = rank(score), ties.method = "random") %>%
arrange(desc(rank))
input.d <- as.data.frame(markers_sort[, 1:3])
We have include the processed data in the package. It can be loaded with
data("gene_pbmc")
input.d <- gene_pbmc
Now we can call the annot
function to run annotation analysis.
annot.GSEA <- easyct(input.d, db="cellmarker", species="Human",
tissue=c("Blood", "Peripheral blood", "Blood vessel",
"Umbilical cord blood", "Venous blood"), p_cut=0.3,
test="GSEA")
We used the GSEA approach to do the annotation. In our package, we use GSEA
function in clusterProfiler
package(Wu et al. 2021) to conduct the enrichment analysis.
You can replace ‘GSEA’ with ‘fisher’ if you would like to use Fisher exact test
to do the annotation.
The candidate tissues can be seen using data(cellmarker_tissue)
,
data(clustermole_tissue)
and data(panglao_tissue)
.
The dot plot showing the overall annotation results can be created by
plot_dot(test="GSEA", annot.GSEA)
Bar plot can be created by
plot_bar(test="GSEA", annot.GSEA)
sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] dplyr_1.1.2 org.Hs.eg.db_3.17.0 AnnotationDbi_1.62.0
#> [4] IRanges_2.34.0 S4Vectors_0.38.0 Biobase_2.60.0
#> [7] BiocGenerics_0.46.0 SeuratObject_4.1.3 Seurat_4.3.0
#> [10] EasyCellType_1.1.1 devtools_2.4.5 usethis_2.1.6
#> [13] BiocStyle_2.28.0
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.20 splines_4.3.0 later_1.3.0
#> [4] bitops_1.0-7 ggplotify_0.1.0 tibble_3.2.1
#> [7] polyclip_1.10-4 lifecycle_1.0.3 rprojroot_2.0.3
#> [10] globals_0.16.2 processx_3.8.1 lattice_0.21-8
#> [13] MASS_7.3-59 magrittr_2.0.3 limma_3.56.0
#> [16] plotly_4.10.1 sass_0.4.5 rmarkdown_2.21
#> [19] jquerylib_0.1.4 yaml_2.3.7 remotes_2.4.2
#> [22] httpuv_1.6.9 sctransform_0.3.5 spatstat.sparse_3.0-1
#> [25] sp_1.6-0 sessioninfo_1.2.2 pkgbuild_1.4.0
#> [28] reticulate_1.28 pbapply_1.7-0 cowplot_1.1.1
#> [31] DBI_1.1.3 RColorBrewer_1.1-3 abind_1.4-5
#> [34] pkgload_1.3.2 zlibbioc_1.46.0 Rtsne_0.16
#> [37] purrr_1.0.1 ggraph_2.1.0 RCurl_1.98-1.12
#> [40] yulab.utils_0.0.6 tweenr_2.0.2 GenomeInfoDbData_1.2.10
#> [43] enrichplot_1.20.0 ggrepel_0.9.3 irlba_2.3.5.1
#> [46] spatstat.utils_3.0-2 listenv_0.9.0 tidytree_0.4.2
#> [49] goftest_1.2-3 spatstat.random_3.1-4 fitdistrplus_1.1-11
#> [52] parallelly_1.35.0 leiden_0.4.3 codetools_0.2-19
#> [55] DOSE_3.26.0 ggforce_0.4.1 tidyselect_1.2.0
#> [58] aplot_0.1.10 farver_2.1.1 viridis_0.6.2
#> [61] spatstat.explore_3.1-0 matrixStats_0.63.0 jsonlite_1.8.4
#> [64] progressr_0.13.0 ellipsis_0.3.2 tidygraph_1.2.3
#> [67] ggridges_0.5.4 survival_3.5-5 tools_4.3.0
#> [70] treeio_1.24.0 ica_1.0-3 Rcpp_1.0.10
#> [73] glue_1.6.2 gridExtra_2.3 xfun_0.39
#> [76] qvalue_2.32.0 GenomeInfoDb_1.36.0 withr_2.5.0
#> [79] BiocManager_1.30.20 fastmap_1.1.1 fansi_1.0.4
#> [82] callr_3.7.3 digest_0.6.31 R6_2.5.1
#> [85] mime_0.12 gridGraphics_0.5-1 colorspace_2.1-0
#> [88] scattermore_0.8 GO.db_3.17.0 tensor_1.5
#> [91] spatstat.data_3.0-1 RSQLite_2.3.1 utf8_1.2.3
#> [94] tidyr_1.3.0 generics_0.1.3 data.table_1.14.8
#> [97] prettyunits_1.1.1 graphlayouts_0.8.4 httr_1.4.5
#> [100] htmlwidgets_1.6.2 scatterpie_0.1.9 org.Mm.eg.db_3.17.0
#> [103] uwot_0.1.14 pkgconfig_2.0.3 gtable_0.3.3
#> [106] blob_1.2.4 lmtest_0.9-40 XVector_0.40.0
#> [109] clusterProfiler_4.8.0 shadowtext_0.1.2 htmltools_0.5.5
#> [112] profvis_0.3.7 bookdown_0.33 fgsea_1.26.0
#> [115] scales_1.2.1 png_0.1-8 ggfun_0.0.9
#> [118] knitr_1.42 reshape2_1.4.4 nlme_3.1-162
#> [121] curl_5.0.0 zoo_1.8-12 cachem_1.0.7
#> [124] stringr_1.5.0 KernSmooth_2.23-20 parallel_4.3.0
#> [127] miniUI_0.1.1.1 HDO.db_0.99.1 desc_1.4.2
#> [130] pillar_1.9.0 grid_4.3.0 vctrs_0.6.2
#> [133] RANN_2.6.1 urlchecker_1.0.1 promises_1.2.0.1
#> [136] cluster_2.1.4 xtable_1.8-4 evaluate_0.20
#> [139] magick_2.7.4 cli_3.6.1 compiler_4.3.0
#> [142] rlang_1.1.0 crayon_1.5.2 future.apply_1.10.0
#> [145] ps_1.7.5 plyr_1.8.8 forcats_1.0.0
#> [148] fs_1.6.2 stringi_1.7.12 deldir_1.0-6
#> [151] viridisLite_0.4.1 BiocParallel_1.34.0 munsell_0.5.0
#> [154] Biostrings_2.68.0 lazyeval_0.2.2 spatstat.geom_3.1-0
#> [157] GOSemSim_2.26.0 Matrix_1.5-4 patchwork_1.1.2
#> [160] bit64_4.0.5 future_1.32.0 ggplot2_3.4.2
#> [163] KEGGREST_1.40.0 shiny_1.7.4 highr_0.10
#> [166] ROCR_1.0-11 igraph_1.4.2 memoise_2.0.1
#> [169] bslib_0.4.2 ggtree_3.8.0 fastmatch_1.1-3
#> [172] bit_4.0.5 downloader_0.4 ape_5.7-1
#> [175] gson_0.1.0
Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Cell. https://doi.org/10.1016/j.cell.2021.04.048.
Wu, Tianzhi, Erqiang Hu, Shuangbin Xu, Meijun Chen, Pingfan Guo, Zehan Dai, Tingze Feng, et al. 2021. “ClusterProfiler 4.0: A Universal Enrichment Tool for Interpreting Omics Data.” The Innovation. https://doi.org/10.1016/j.xinn.2021.100141.