scfind
package vignetteSingleCellExperiment
classscfind
is built on top of the Bioconductor’s SingleCellExperiment class. scfind
operates on objects of class SingleCellExperiment
and writes all of its results back to the the object.
scfind
InputIf you already have an SCESet
object, then proceed to the next chapter.
If you have a matrix or a data frame containing expression data then you first need to create an SingleCellExperiment
object containing your data. For illustrative purposes we will use an example expression matrix provided with scfind
. The dataset (yan
) represents FPKM gene expression of 90 cells derived from human embryo. The authors (Yan et al.) have defined developmental stages of all cells in the original publication (ann
data frame). We will use these stages in projection later.
library(SingleCellExperiment)
library(scfind)
head(ann)
## cell_type1
## Oocyte..1.RPKM. zygote
## Oocyte..2.RPKM. zygote
## Oocyte..3.RPKM. zygote
## Zygote..1.RPKM. zygote
## Zygote..2.RPKM. zygote
## Zygote..3.RPKM. zygote
yan[1:3, 1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152 0.0 0.0 0.0
## RPS11 1219.9 1021.1 931.6
## ELMO2 7.0 12.2 9.3
Note that the cell type information has to be stored in the cell_type1
column of the rowData
slot of the SingleCellExperiment
object.
Now let’s create a SingleCellExperiment
object of the yan
dataset:
sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
# this is needed to calculate dropout rate for feature selection
# important: normcounts have the same zeros as raw counts (fpkm)
counts(sce) <- normcounts(sce)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment
## dim: 20214 90
## metadata(0):
## assays(3): normcounts counts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
## Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC
If one has a list of genes that you would like to check against you dataset, i.e. find the cell types that most likely represent your genes (highest expression), then scfind
allows one to do that by first creating a gene index and then very quickly searching the index:
geneIndex <- buildCellTypeIndex(sce)
p_values <- -log10(findCellType(geneIndex, c("SOX6", "SNAI3")))
barplot(p_values, ylab = "-log10(pval)", las = 2)
The calculation above shows that a list of genes containing SOX6
and SNAI3
is specific for the zygote
cell type.
If one is more interested in finding out in which cells all the genes from your gene list are expressed than you can build a cell index instead of a cell type index. buildCellIndex
function should be used for building the index and findCell
for searching the index:
geneIndex <- buildCellIndex(sce)
res <- findCell(geneIndex, c("SOX6", "SNAI3"))
res$common_exprs_cells
## cell_id cell_type
## 1 2 zygote
## 2 3 zygote
## 3 5 zygote
## 4 6 zygote
## 5 23 4cell
## 6 25 8cell
## 7 27 8cell
## 8 58 16cell
## 9 68 blast
Cell search reports the p-values corresponding to cell types as well:
barplot(-log10(res$p_values), ylab = "-log10(pval)", las = 2)
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2 scfind_1.0.0
## [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.0
## [5] DelayedArray_0.4.0 matrixStats_0.52.2
## [7] Biobase_2.38.0 GenomicRanges_1.30.0
## [9] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [11] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [13] knitr_1.17 BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 plyr_1.8.4 bindr_0.1
## [4] compiler_3.4.2 XVector_0.18.0 bitops_1.0-6
## [7] tools_3.4.2 zlibbioc_1.24.0 digest_0.6.12
## [10] bit_1.1-12 tibble_1.3.4 evaluate_0.10.1
## [13] lattice_0.20-35 pkgconfig_2.0.1 rlang_0.1.2
## [16] Matrix_1.2-11 yaml_2.1.14 GenomeInfoDbData_0.99.1
## [19] stringr_1.2.0 dplyr_0.7.4 rprojroot_1.2
## [22] grid_3.4.2 glue_1.2.0 R6_2.2.2
## [25] hash_2.2.6 rmarkdown_1.6 bookdown_0.5
## [28] reshape2_1.4.2 magrittr_1.5 backports_1.1.1
## [31] codetools_0.2-15 htmltools_0.3.6 assertthat_0.2.0
## [34] stringi_1.1.5 RCurl_1.95-4.8