scrapper
implements bindings to C++ code for analyzing single-cell data, mostly
from the libscran
libraries. Each function performs an individual analysis step ranging
from quality control to clustering and marker detection. Users can also
conveniently perform end-to-end analyses with a single call to the
analyze.se() function.
Let’s fetch a small single-cell RNA-seq dataset for demonstration purposes:
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
We run it through the scrapper analysis pipeline. This performs the usual steps - quality control, normalization, selection of highly variable genes, PCA, clustering, visualization and marker detection, see the OSCA book for more theoretical details.
library(scrapper)
is.mito <- grep("^mt-", rownames(sce.z))
results <- analyze.se(
sce.z,
rna.qc.subsets=list(Mito=is.mito),
num.threads=2 # limit number of threads to keep R CMD check happy.
)
results$x # most outputs are stored in a copy of the input SingleCellExperiment. ## class: SingleCellExperiment
## dim: 20006 2874
## metadata(2): qc PCA
## assays(2): counts logcounts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(6): featureType means ... residuals hvg
## colnames(2874): 1772071015_C02 1772071017_G12 ... 1772063068_D01
## 1772066098_A12
## colData names(15): tissue group # ... sizeFactor graph.cluster
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: gene
## altExpNames(2): repeat ERCC
Now we can have a look at some of the results. For example, we can make a t-SNE plot:
We can also look at the markers defining each cluster:
# Ordering by the median AUC across pairwise comparisons involving the cluster.
previewMarkers(results$markers$rna[[1]], order.by="auc.median")## DataFrame with 10 rows and 4 columns
## mean detected lfc auc.median
## <numeric> <numeric> <numeric> <numeric>
## Gad1 4.79503 1.000000 4.56949 0.997684
## Gad2 4.44192 0.996503 4.25766 0.995434
## Ndrg4 4.40310 0.996503 2.59179 0.990182
## Stmn3 4.71546 0.993007 2.64538 0.986791
## Tspyl4 3.36568 1.000000 2.15128 0.983035
## Syp 2.79635 0.993007 1.53131 0.976301
## Rab3a 3.34302 0.989510 1.82740 0.975972
## Scg5 4.74990 1.000000 2.06392 0.975421
## Snap25 5.40529 1.000000 2.61350 0.974922
## Slc6a1 3.75820 0.993007 3.08908 0.974192
analyze.se() can also be broken down to its component
functions for further customization:
test <- sce.z
test <- quickRnaQc.se(test, subsets=list(mt=is.mito), num.threads=2)
test <- test[,test$keep]
test <- normalizeRnaCounts.se(test, size.factors=test$sum)
test <- chooseRnaHvgs.se(test, num.threads=2)
test <- runPca.se(test, features=rowData(test)$hvg, num.threads=2)
test <- runAllNeighborSteps.se(test, num.threads=2)
markers <- scoreMarkers.se(test, groups=test$clusters, num.threads=2)Let’s fetch another brain dataset and combine it with the previous one.
sce.t <- TasicBrainData()
# Only using the common genes in both datasets.
common <- intersect(rownames(sce.z), rownames(sce.t))
combined <- combineCols(sce.t[common,], sce.z[common,])
block <- rep(c("tasic", "zeisel"), c(ncol(sce.t), ncol(sce.z)))We set block= in each step to account for the batch
structure. This ensures that the various calculations are not affected
by inter-block differences. It also uses MNN correction to batch effects
in the low-dimensional space prior to further analyses.
# No mitochondrial genes in the combined set, so we'll just skip it.
blocked_res <- analyze.se(combined, block=block, num.threads=2)Hopefully, the correction is able to get rid of the batch effect and merge the datasets together.
We can also check the distribution of clusters across batches. The presence of batch-specific clusters might indicate that the batch effect was not fully removed; or they might represent cell types that are unique to one of the studies, who knows.
##
## tasic zeisel
## 1 122 90
## 2 477 70
## 3 364 130
## 4 183 128
## 5 43 362
## 6 22 121
## 7 91 99
## 8 20 150
## 9 31 12
## 10 48 18
## 11 248 106
## 12 20 701
## 13 7 252
## 14 1 213
## 15 8 87
## 16 0 463
Finally, some markers for another cluster:
## DataFrame with 10 rows and 3 columns
## mean detected lfc
## <numeric> <numeric> <numeric>
## Gad1 4.99380 1.000000 3.26141
## Igf1 4.22957 0.990566 3.22416
## Synpr 5.06317 0.985849 3.61784
## Cnr1 6.27028 1.000000 4.46956
## Slc6a1 4.07537 1.000000 2.19651
## Gad2 3.45237 1.000000 2.42218
## Vstm2a 3.37646 0.995283 2.26351
## Vip 5.68657 0.863208 5.08033
## Nap1l5 3.72545 1.000000 2.35075
## Nrxn3 4.17803 1.000000 2.27554
We can also compute pseudo-bulk profiles for each cluster-dataset combination, e.g., for differential expression analyses.
aggregates <- aggregateAcrossCells.se(
blocked_res$x,
list(cluster=blocked_res$x$graph.cluster, dataset=blocked_res$x$block)
)
aggregates## class: SummarizedExperiment
## dim: 18698 31
## metadata(1): aggregated
## assays(2): sums detected
## rownames(18698): Tspan12 Tshz1 ... Uty Usp9y
## rowData names(6): featureType means ... residuals hvg
## colnames: NULL
## colData names(30): factor.cluster factor.dataset ... block
## graph.cluster
Let’s fetch a single-cell dataset with both RNA-seq and CITE-seq data. To keep the run-time short, we’ll only consider the first 5000 cells.
sce.s <- sce.s[,1:5000]
# Combining the various ADT-related alternative experiments into a single object.
altExp(sce.s, "all.adts") <- rbind(altExp(sce.s, "adt1"), altExp(sce.s, "adt2"))
altExp(sce.s, "adt1") <- altExp(sce.s, "adt2") <- NULL
sce.s## class: SingleCellExperiment
## dim: 27679 5000
## metadata(0):
## assays(1): counts
## rownames(27679): A1BG A1BG-AS1 ... hsa-mir-8072 snoU2-30
## rowData names(0):
## colnames(5000): TGCCAAATCTCTAAGG ACTGCTCAGGTGTTAA ... CGCTATCGTCCGTCAG
## GGCGACTGTAAGGGAA
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): all.adts
We specify adt.altexp= to indicate that one of the
alternative experiments contains ADT data. This instructs the analysis
to use data from both modalities during clustering and
visualization.
is.mito <- grepl("^MT-", rownames(sce.s))
is.igg <- grepl("^IgG", rownames(altExp(sce.s, 'all.adts')))
multi_res <- analyze.se(
sce.s,
adt.altexp="all.adts",
rna.qc.subsets=list(mito=is.mito),
adt.qc.subsets=list(igg=is.igg),
num.threads=2
)Here’s an obligatory plot:
We can inspect the top markers among the ADTs for a cluster:
# By default, markers are sorted by the mean Cohen's d.
# We just replace the column name when printing the dataframe here.
previewMarkers(multi_res$markers$adt[[5]], c(effect="cohens.d.mean"))## DataFrame with 10 rows and 1 column
## effect
## <numeric>
## CD4-TAGACAGCTG 3.6884442
## CD3-TCTTCGTCCA 0.7931371
## CD45RA-AGTCCATTGG 0.1395605
## CCR7-GTAAACCGCA 0.0754008
## IgG_control-CATGATTGGCTC -0.0190229
## IgM_control-AAGCGCTTGGCA -0.0508764
## IgG2b_2_control-CATCGGTGTACA -0.1411518
## PD-1-GTACTTCCAG -0.3935047
## IgG1_2_control-GTCTAGACTTCG -0.5329377
## CD8-AGATGAACCC -0.5452329
… along with the RNA markers, as before.
# Using the delta-detected to focus on silent-to-expressed genes.
previewMarkers(multi_res$markers$rna[[5]], order.by="delta.detected.mean")## DataFrame with 10 rows and 4 columns
## mean detected lfc delta.detected.mean
## <numeric> <numeric> <numeric> <numeric>
## ENAH 1.36986 0.94375 0.820350 0.598778
## CNN3 2.09056 0.97500 1.224571 0.597853
## WBP5 2.01220 0.98125 1.145416 0.594153
## PLS3 1.75284 0.96875 1.012275 0.591067
## CENPV 1.90134 0.96250 1.130238 0.589109
## S100A13 1.35069 0.93750 0.793204 0.588513
## PCBD1 1.70390 0.96875 0.931640 0.587040
## UCHL1 1.71843 0.96250 0.954069 0.585129
## CALD1 1.60756 0.93750 0.957796 0.579067
## RDX 1.26275 0.93125 0.739840 0.577328
The runAllNeighborSteps() fucntion will run
runUmap(), runTsne(),
buildSnnGraph() and clusterGraph() in a single
call. This runs the UMAP/t-SNE iterations and the clustering in parallel
to maximize use of multiple threads.
The scoreGeneSet() function will compute a gene set
score based on the input expression matrix. This can be used to
summarize the activity of pathways into a single per-cell score for
visualization.
The subsampleByNeighbors() function will
deterministically select a representative subset of cells based on their
local neighborhood. This can be used to reduce the compute time of the
various steps downstream of the PCA.
For CRISPR data, quality control can be performed using
computeCrisprQcMetrics(),
suggestCrisprQcThresholds() and
filterCrisprQcMetrics(). To normalize, we use size factors
defined by centering the total sum of guide counts for each cell.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.39.2 ggplot2_4.0.2
## [3] scuttle_1.21.0 scrapper_1.5.12
## [5] scRNAseq_2.25.0 SingleCellExperiment_1.33.0
## [7] SummarizedExperiment_1.41.1 Biobase_2.71.0
## [9] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [11] IRanges_2.45.0 S4Vectors_0.49.0
## [13] BiocGenerics_0.57.0 generics_0.1.4
## [15] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [17] knitr_1.51 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_2.0.0
## [4] magrittr_2.0.4 ggbeeswarm_0.7.3 GenomicFeatures_1.63.1
## [7] gypsum_1.7.0 farver_2.1.2 rmarkdown_2.30
## [10] BiocIO_1.21.0 vctrs_0.7.1 memoise_2.0.1
## [13] Rsamtools_2.27.0 RCurl_1.98-1.17 htmltools_0.5.9
## [16] S4Arrays_1.11.1 AnnotationHub_4.1.0 curl_7.0.0
## [19] BiocNeighbors_2.5.4 Rhdf5lib_1.33.0 SparseArray_1.11.10
## [22] rhdf5_2.55.13 sass_0.4.10 alabaster.base_1.11.2
## [25] bslib_0.10.0 alabaster.sce_1.11.0 httr2_1.2.2
## [28] cachem_1.1.0 buildtools_1.0.0 GenomicAlignments_1.47.0
## [31] lifecycle_1.0.5 pkgconfig_2.0.3 rsvd_1.0.5
## [34] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0
## [37] digest_0.6.39 AnnotationDbi_1.73.0 irlba_2.3.7
## [40] ExperimentHub_3.1.0 RSQLite_2.4.6 beachmat_2.27.2
## [43] filelock_1.0.3 labeling_0.4.3 httr_1.4.7
## [46] abind_1.4-8 compiler_4.5.2 bit64_4.6.0-1
## [49] withr_3.0.2 S7_0.2.1 BiocParallel_1.45.0
## [52] viridis_0.6.5 DBI_1.2.3 HDF5Array_1.39.0
## [55] alabaster.ranges_1.11.0 alabaster.schemas_1.11.0 rappdirs_0.3.4
## [58] DelayedArray_0.37.0 rjson_0.2.23 tools_4.5.2
## [61] vipor_0.4.7 beeswarm_0.4.0 glue_1.8.0
## [64] h5mread_1.3.1 restfulr_0.0.16 rhdf5filters_1.23.3
## [67] grid_4.5.2 gtable_0.3.6 ensembldb_2.35.0
## [70] BiocSingular_1.27.1 ScaledMatrix_1.19.0 XVector_0.51.0
## [73] ggrepel_0.9.6 BiocVersion_3.23.1 pillar_1.11.1
## [76] dplyr_1.2.0 BiocFileCache_3.1.0 lattice_0.22-9
## [79] rtracklayer_1.71.3 bit_4.6.0 tidyselect_1.2.1
## [82] maketools_1.3.2 Biostrings_2.79.4 gridExtra_2.3
## [85] ProtGenerics_1.43.0 xfun_0.56 UCSC.utils_1.7.1
## [88] lazyeval_0.2.2 yaml_2.3.12 evaluate_1.0.5
## [91] codetools_0.2-20 cigarillo_1.1.0 tibble_3.3.1
## [94] alabaster.matrix_1.11.0 BiocManager_1.30.27 cli_3.6.5
## [97] jquerylib_0.1.4 Rcpp_1.1.1 GenomeInfoDb_1.47.2
## [100] dbplyr_2.5.1 png_0.1-8 XML_3.99-0.22
## [103] parallel_4.5.2 blob_1.3.0 AnnotationFilter_1.35.0
## [106] bitops_1.0-9 alabaster.se_1.11.0 viridisLite_0.4.3
## [109] scales_1.4.0 purrr_1.2.1 crayon_1.5.3
## [112] rlang_1.1.7 KEGGREST_1.51.1