This vignette shows the packaged Microglia example for
population-scale simulation with genetic effects inside
simPIC. The example uses reusable extdata
files so the workflow can be rerun without rebuilding intermediate
objects. The comparison plots follow the later Poptrial.Rmd
pattern by using bluster to compare PCA structure,
silhouette widths, and neighborhood purity between real and simulated
cells.
The example follows the splatPop estimation pattern:
SingleCellExperiment with
Sample, Library, Batch, and
sample_batch metadatabigcounts)The first part provides a quick start for running the packaged example. The second part examines the returned objects in more detail and relates them to the splatPop-style estimation strategy.
The Microglia example ships with:
microglia_sce.rdsmicroglia_sample_map.tsvmicroglia_peak_annot_chr14.tsvmicroglia_chr14.vcfThe canonical starting point is microglia_sce.rds, which
already stores the Microglia counts and cell metadata in a package-ready
SingleCellExperiment object.
extdir.candidates <- c(
file.path("inst", "extdata"),
file.path("..", "inst", "extdata"),
system.file("extdata", package = "simPIC")
)
extdir <- extdir.candidates[
file.exists(file.path(extdir.candidates, "microglia_sce.rds"))
][1]
stopifnot(!is.na(extdir), nzchar(extdir), dir.exists(extdir))
list.files(extdir, pattern = "^microglia")
#> [1] "microglia_chr14.vcf" "microglia_peak_annot_chr14.tsv"
#> [3] "microglia_sample_map.tsv" "microglia_sce.rds"microglia_sce <- readRDS(file.path(extdir, "microglia_sce.rds"))
microglia_sce
#> class: SingleCellExperiment
#> dim: 300 240
#> metadata(0):
#> assays(1): counts
#> rownames(300): Peak61395 Peak63173 ... Peak62680 Peak62933
#> rowData names(19): seqnames start ... N peak_id
#> colnames(240): D19-13169#ACATGCATCAACGTGT-1
#> D19-13169#AACAAAGAGTTCGTTG-1 ... D19-13182#CGTAAACGTATCGCGC-1
#> D19-13182#TGCTTTAAGAGGCAGG-1
#> colData names(43): BlacklistRatio DoubletEnrichment ... Batch
#> sample_batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
head(colData(microglia_sce)[, c("Sample", "Library", "Batch", "sample_batch", "Pathology")])
#> DataFrame with 6 rows and 5 columns
#> Sample Library Batch
#> <character> <character> <character>
#> D19-13169#ACATGCATCAACGTGT-1 D19-13169 Library10 Library10
#> D19-13169#AACAAAGAGTTCGTTG-1 D19-13169 Library10 Library10
#> D19-13169#TACGGATTCAGCACGC-1 D19-13169 Library10 Library10
#> D19-13169#CGGACTGAGTTTACGC-1 D19-13169 Library10 Library10
#> D19-13169#ACCATCCTCCTATCCG-1 D19-13169 Library10 Library10
#> D19-13169#GCTCACTGTGAGGTCA-1 D19-13169 Library10 Library10
#> sample_batch Pathology
#> <character> <character>
#> D19-13169#ACATGCATCAACGTGT-1 D19-13169_Library10 nonAD
#> D19-13169#AACAAAGAGTTCGTTG-1 D19-13169_Library10 nonAD
#> D19-13169#TACGGATTCAGCACGC-1 D19-13169_Library10 nonAD
#> D19-13169#CGGACTGAGTTTACGC-1 D19-13169_Library10 nonAD
#> D19-13169#ACCATCCTCCTATCCG-1 D19-13169_Library10 nonAD
#> D19-13169#GCTCACTGTGAGGTCA-1 D19-13169_Library10 nonAD
peak.cols <- intersect(
c("seqnames", "start", "end", "peak_id", "original_peakid"),
colnames(rowData(microglia_sce))
)
head(rowData(microglia_sce)[, peak.cols])
#> DataFrame with 6 rows and 4 columns
#> seqnames start end peak_id
#> <character> <integer> <integer> <character>
#> Peak61395 chr14 20198863 20199363 Peak61395
#> Peak63173 chr14 76768229 76768729 Peak63173
#> Peak62246 chr14 58139043 58139543 Peak62246
#> Peak62994 chr14 74612443 74612943 Peak62994
#> Peak61749 chr14 24399419 24399919 Peak61749
#> Peak63699 chr14 92647787 92648287 Peak63699The high-level wrapper runs the packaged example end to end. For this vignette, the compact packaged data use a small number of donor-batch units and chr14 peaks so the workflow can run during package checks.
microglia_out <- simPICMicrogliaExample(
min.cells = 20,
pop.cv.bins = 10,
eqtl.n = 0.05,
similarity.scale = 2,
eqtl.dist = 1e8,
sparsify = FALSE,
pca.ntop = 100,
pca.components = 5,
plot.n.samples = 4,
seed = 101,
verbose = TRUE
)names(microglia_out)
#> [1] "sce" "bigcounts" "aggregated" "params"
#> [5] "sim" "comparison_real" "comparison_sim" "plot_samples"
#> [9] "plots" "sample_map" "peak_annot" "vcf"
microglia_out$plot_samples
#> [1] "D19-13169" "D19-13166" "D19-13157" "D19-125457"
microglia_out$params
#> A Params object of class SplatPopParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (GENES) (CELLS) [SEED]
#> 300 239 527161
#>
#> 54 additional parameters
#>
#> Batches:
#> [BATCHES] [BATCH CELLS] [Location] [Scale]
#> 6 40, 40, 39, 40,... 0.1 0.1
#> [Remove]
#> FALSE
#>
#> Mean:
#> (RATE) (SHAPE)
#> 23.2821639801622 3.94733789609969
#>
#> Library size:
#> (LOCATION) (SCALE) (Norm)
#> 3.77419950862729 0.607742862092932 FALSE
#>
#> Exprs outliers:
#> (PROBABILITY) (LOCATION) (SCALE)
#> 0.0300751879699248 1.16252190468824 0.106161982410875
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [Probability] [Down Prob] [Location] [Scale]
#> 0.1 0.5 0.1 0.4
#>
#> BCV:
#> (COMMON DISP) (DOF)
#> 0.685607421548063 Inf
#>
#> Dropout:
#> [Type] (MIDPOINT) (SHAPE)
#> none 1.03878641236957 -0.748977342342645
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
#>
#> Population params:
#> (MEAN.SHAPE) (MEAN.RATE) [POP.QUANT.NORM] [SIMILARITY.SCALE]
#> 11.8901624639648 54.5885689409818 FALSE 2
#> [BATCH.SIZE] [nCells.sample] [nCells.shape] [nCells.rate]
#> 1 FALSE 1.5 0.015
#> [cv.bins]
#> 10
#>
#> (CV.PARAMS)
#> data.frame (10 x 4) with columns: start, end, shape, rate
#> start end shape rate
#> 1 0.000 0.150 11.87344 16.50311
#> 2 0.150 0.158 13.78051 17.24523
#> 3 0.158 0.167 12.99690 16.51243
#> 4 0.167 0.183 22.78417 31.76292
#> # ... with 6 more rows
#>
#> eQTL params:
#> [EQTL.N] [EQTL.DIST]
#> 0.05 1e+08
#> [eqtl.maf.min] [eqtl.maf.max]
#> 0.05 1
#> [eqtl.coreg] [eqtl.group.specific]
#> 0 0.2
#> [eqtl.condition.specific] (eqtl.ES.shape)
#> 0.2 3.6
#> (eqtl.ES.rate)
#> 12
#>
#> Condition params:
#> [nConditions] [condition.prob] [cde.prob] [cde.downProb]
#> 1 1 0.1 0.5
#> [cde.facLoc] [cde.facScale]
#> 0.1 0.4
microglia_out$sim
#> class: SingleCellExperiment
#> dim: 300 239
#> metadata(3): Params Simulated_Means simPIC.population.model
#> assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
#> rownames(300): gene_001 gene_002 ... gene_299 gene_300
#> rowData names(34): MAP46246604.Group1.Gene._
#> MAP46246604.Group1.GeneMean._ ... eQTL.EffectSize
#> ConditionDE.Condition1
#> colnames(239): MAP46246604:Cell1 MAP46246604:Cell2 ... SM-CTECR:Cell39
#> SM-CTECR:Cell40
#> colData names(6): Cell Batch ... Group Condition
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):The workflow returns both general PCA summaries and Poptrial-style comparison plots for the selected library-specific subset.
simPICMicrogliaExample() is a packaged example wrapper
around the lower-level splatPop-style estimation and simulation helpers.
It starts from microglia_sce.rds, prepares the data for
estimation and simulation, and constructs the comparison subsets used in
the diagnostic plots.
The most important returned components are:
sce: the filtered real-data
SingleCellExperiment used for the main population-scale
workflowbigcounts: the real donor-batch unit with the most
cells, used for the count-based part of parameter estimationaggregated: donor-batch aggregated means used for the
population-scale component of estimationparams: the estimated SplatPopParamssim: the simulated
SingleCellExperimentcomparison_real and comparison_sim:
optional library-specific real and simulated objects when a comparison
subset is requestednames(microglia_out)
#> [1] "sce" "bigcounts" "aggregated" "params"
#> [5] "sim" "comparison_real" "comparison_sim" "plot_samples"
#> [9] "plots" "sample_map" "peak_annot" "vcf"
dim(microglia_out$sce)
#> [1] 300 239
dim(microglia_out$bigcounts)
#> [1] 300 40
dim(microglia_out$aggregated)
#> [1] 300 6
dim(microglia_out$sim)
#> [1] 300 239
dim(microglia_out$comparison_real)
#> NULL
dim(microglia_out$comparison_sim)
#> NULLbigcounts?The Microglia workflow follows the splatPop estimation pattern:
Accordingly, bigcounts is not an arbitrary subset. It is
the largest retained sample_batch unit in the filtered real
data, and it is paired with aggregated during parameter
estimation.
microglia_out$bigcounts
#> class: SingleCellExperiment
#> dim: 300 40
#> metadata(0):
#> assays(1): counts
#> rownames(300): Peak61395 Peak63173 ... Peak62680 Peak62933
#> rowData names(19): peak_seqnames peak_start ... N peak_id
#> colnames(40): D19-125457#AACTGGTTCTATACCT-1
#> D19-125457#TTCGATTCACTGCTTC-1 ... D19-125457#ATAGTCGAGACCATAA-1
#> D19-125457#CCTTAATCATGGAGGT-1
#> colData names(43): BlacklistRatio DoubletEnrichment ... Batch
#> sample_batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
table(microglia_out$bigcounts$sample_batch)
#>
#> D19-125457_Library4
#> 40
table(microglia_out$bigcounts$Sample)
#>
#> D19-125457
#> 40The default plotting panels use the most abundant retained samples in
the compact example. A specific library or pathology subset can be
requested with comparison.batch and
comparison.pathology when the input data include the
desired groups.
microglia_out$plot_samples
#> [1] "D19-13169" "D19-13166" "D19-13157" "D19-125457"
table(microglia_out$sce$Sample)
#>
#> D19-125457 D19-13157 D19-13165 D19-13166 D19-13169 D19-13182
#> 40 40 39 40 40 40
if ("Sample" %in% colnames(colData(microglia_out$sim))) {
table(colData(microglia_out$sim)$Sample)
}
#>
#> D19-125457 D19-13157 D19-13165 D19-13166 D19-13169 D19-13182
#> 40 39 40 40 40 40The returned plots address slightly different questions:
plots$bluster$cell_pca_by_sample: whether the real and
simulated library show similar sample-level structure in PCA spaceplots$bluster$silhouette_width: how well cells are
separated by sample within the selected libraryplots$bluster$neighborhood_purity: how pure local
neighborhoods are with respect to sample labelsplots$comparison: simple side-by-side PCA panelsThe plots$bluster set is the closest to the
manuscript-style Figure 6 comparison and is the most useful starting
point for assessing whether the simulated data recover the corresponding
real-data structure.
Library
and Pathology labels from the reference Microglia object so
larger user-supplied examples can request focused comparison
panels.sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] simPIC_1.9.0 SingleCellExperiment_1.35.1
#> [3] SummarizedExperiment_1.43.0 Biobase_2.73.1
#> [5] GenomicRanges_1.65.0 Seqinfo_1.3.0
#> [7] IRanges_2.47.2 S4Vectors_0.51.3
#> [9] BiocGenerics_0.59.6 generics_0.1.4
#> [11] MatrixGenerics_1.25.0 matrixStats_1.5.0
#> [13] BiocStyle_2.41.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.3.0 bitops_1.0-9 gridExtra_2.3
#> [4] rlang_1.2.0 magrittr_2.0.5 scater_1.41.1
#> [7] compiler_4.6.0 RSQLite_3.53.1 GenomicFeatures_1.65.0
#> [10] png_0.1-9 vctrs_0.7.3 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0 backports_1.5.1
#> [16] XVector_0.53.0 labeling_0.4.3 scuttle_1.23.1
#> [19] Rsamtools_2.29.0 rmarkdown_2.31 UCSC.utils_1.9.0
#> [22] ggbeeswarm_0.7.3 preprocessCore_1.75.0 bit_4.6.0
#> [25] xfun_0.57 bluster_1.23.0 cachem_1.1.0
#> [28] beachmat_2.29.0 cigarillo_1.3.0 GenomeInfoDb_1.49.1
#> [31] jsonlite_2.0.0 blob_1.3.0 DelayedArray_0.39.3
#> [34] BiocParallel_1.47.0 irlba_2.3.7 parallel_4.6.0
#> [37] cluster_2.1.8.2 R6_2.6.1 VariantAnnotation_1.59.0
#> [40] RColorBrewer_1.1-3 bslib_0.11.0 limma_3.69.1
#> [43] rtracklayer_1.73.0 jquerylib_0.1.4 Rcpp_1.1.1-1.1
#> [46] knitr_1.51 BiocBaseUtils_1.15.1 Matrix_1.7-5
#> [49] splines_4.6.0 igraph_2.3.1 viridis_0.6.5
#> [52] abind_1.4-8 yaml_2.3.12 codetools_0.2-20
#> [55] curl_7.1.0 lattice_0.22-9 S7_0.2.2
#> [58] withr_3.0.2 KEGGREST_1.53.0 evaluate_1.0.5
#> [61] survival_3.8-6 fitdistrplus_1.2-6 Biostrings_2.81.2
#> [64] BiocManager_1.30.27 checkmate_2.3.4 RCurl_1.98-1.18
#> [67] ggplot2_4.0.3 scales_1.4.0 glue_1.8.1
#> [70] maketools_1.3.2 tools_4.6.0 BiocIO_1.23.3
#> [73] BiocNeighbors_2.7.2 sys_3.4.3 ScaledMatrix_1.21.0
#> [76] BSgenome_1.81.0 locfit_1.5-9.12 GenomicAlignments_1.49.0
#> [79] buildtools_1.0.0 XML_3.99-0.23 grid_4.6.0
#> [82] AnnotationDbi_1.75.0 edgeR_4.11.1 beeswarm_0.4.0
#> [85] BiocSingular_1.29.0 vipor_0.4.7 restfulr_0.0.16
#> [88] cli_3.6.6 rsvd_1.0.5 viridisLite_0.4.3
#> [91] S4Arrays_1.13.0 gtable_0.3.6 sass_0.4.10
#> [94] digest_0.6.39 ggrepel_0.9.8 SparseArray_1.13.2
#> [97] rjson_0.2.23 farver_2.1.2 memoise_2.0.1
#> [100] htmltools_0.5.9 lifecycle_1.0.5 httr_1.4.8
#> [103] splatter_1.37.0 statmod_1.5.2 bit64_4.8.2
#> [106] MASS_7.3-65