simPIC Microglia Population Example

Overview

This vignette shows the packaged Microglia example for population-scale simulation with genetic effects inside simPIC. The example uses reusable extdata files so the workflow can be rerun without rebuilding intermediate objects. The comparison plots follow the later Poptrial.Rmd pattern by using bluster to compare PCA structure, silhouette widths, and neighborhood purity between real and simulated cells.

The example follows the splatPop estimation pattern:

  • use a prebuilt SingleCellExperiment with Sample, Library, Batch, and sample_batch metadata
  • aggregate donor-batch means across retained units
  • estimate splatPop parameters from a high-cell donor-batch subset (bigcounts)
  • simulate population-scale peak accessibility from aligned Microglia peaks and donors

The first part provides a quick start for running the packaged example. The second part examines the returned objects in more detail and relates them to the splatPop-style estimation strategy.

suppressPackageStartupMessages({
    library(simPIC)
    library(SingleCellExperiment)
})

microglia_vignette_deps <- c(
    "splatter",
    "VariantAnnotation",
    "bluster",
    "preprocessCore"
)
microglia_vignette_ready <- all(vapply(
    microglia_vignette_deps,
    requireNamespace,
    logical(1),
    quietly = TRUE
))

Packaged example data

The Microglia example ships with:

  • microglia_sce.rds
  • microglia_sample_map.tsv
  • microglia_peak_annot_chr14.tsv
  • microglia_chr14.vcf

The canonical starting point is microglia_sce.rds, which already stores the Microglia counts and cell metadata in a package-ready SingleCellExperiment object.

extdir.candidates <- c(
    file.path("inst", "extdata"),
    file.path("..", "inst", "extdata"),
    system.file("extdata", package = "simPIC")
)
extdir <- extdir.candidates[
    file.exists(file.path(extdir.candidates, "microglia_sce.rds"))
][1]
stopifnot(!is.na(extdir), nzchar(extdir), dir.exists(extdir))
list.files(extdir, pattern = "^microglia")
#> [1] "microglia_chr14.vcf"            "microglia_peak_annot_chr14.tsv"
#> [3] "microglia_sample_map.tsv"       "microglia_sce.rds"
microglia_sce <- readRDS(file.path(extdir, "microglia_sce.rds"))
microglia_sce
#> class: SingleCellExperiment 
#> dim: 300 240 
#> metadata(0):
#> assays(1): counts
#> rownames(300): Peak61395 Peak63173 ... Peak62680 Peak62933
#> rowData names(19): seqnames start ... N peak_id
#> colnames(240): D19-13169#ACATGCATCAACGTGT-1
#>   D19-13169#AACAAAGAGTTCGTTG-1 ... D19-13182#CGTAAACGTATCGCGC-1
#>   D19-13182#TGCTTTAAGAGGCAGG-1
#> colData names(43): BlacklistRatio DoubletEnrichment ... Batch
#>   sample_batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
head(colData(microglia_sce)[, c("Sample", "Library", "Batch", "sample_batch", "Pathology")])
#> DataFrame with 6 rows and 5 columns
#>                                   Sample     Library       Batch
#>                              <character> <character> <character>
#> D19-13169#ACATGCATCAACGTGT-1   D19-13169   Library10   Library10
#> D19-13169#AACAAAGAGTTCGTTG-1   D19-13169   Library10   Library10
#> D19-13169#TACGGATTCAGCACGC-1   D19-13169   Library10   Library10
#> D19-13169#CGGACTGAGTTTACGC-1   D19-13169   Library10   Library10
#> D19-13169#ACCATCCTCCTATCCG-1   D19-13169   Library10   Library10
#> D19-13169#GCTCACTGTGAGGTCA-1   D19-13169   Library10   Library10
#>                                     sample_batch   Pathology
#>                                      <character> <character>
#> D19-13169#ACATGCATCAACGTGT-1 D19-13169_Library10       nonAD
#> D19-13169#AACAAAGAGTTCGTTG-1 D19-13169_Library10       nonAD
#> D19-13169#TACGGATTCAGCACGC-1 D19-13169_Library10       nonAD
#> D19-13169#CGGACTGAGTTTACGC-1 D19-13169_Library10       nonAD
#> D19-13169#ACCATCCTCCTATCCG-1 D19-13169_Library10       nonAD
#> D19-13169#GCTCACTGTGAGGTCA-1 D19-13169_Library10       nonAD
peak.cols <- intersect(
    c("seqnames", "start", "end", "peak_id", "original_peakid"),
    colnames(rowData(microglia_sce))
)
head(rowData(microglia_sce)[, peak.cols])
#> DataFrame with 6 rows and 4 columns
#>              seqnames     start       end     peak_id
#>           <character> <integer> <integer> <character>
#> Peak61395       chr14  20198863  20199363   Peak61395
#> Peak63173       chr14  76768229  76768729   Peak63173
#> Peak62246       chr14  58139043  58139543   Peak62246
#> Peak62994       chr14  74612443  74612943   Peak62994
#> Peak61749       chr14  24399419  24399919   Peak61749
#> Peak63699       chr14  92647787  92648287   Peak63699

Quick Start

The high-level wrapper runs the packaged example end to end. For this vignette, the compact packaged data use a small number of donor-batch units and chr14 peaks so the workflow can run during package checks.

microglia_out <- simPICMicrogliaExample(
    min.cells = 20,
    pop.cv.bins = 10,
    eqtl.n = 0.05,
    similarity.scale = 2,
    eqtl.dist = 1e8,
    sparsify = FALSE,
    pca.ntop = 100,
    pca.components = 5,
    plot.n.samples = 4,
    seed = 101,
    verbose = TRUE
)
names(microglia_out)
#>  [1] "sce"             "bigcounts"       "aggregated"      "params"         
#>  [5] "sim"             "comparison_real" "comparison_sim"  "plot_samples"   
#>  [9] "plots"           "sample_map"      "peak_annot"      "vcf"
microglia_out$plot_samples
#> [1] "D19-13169"  "D19-13166"  "D19-13157"  "D19-125457"
microglia_out$params
#> A Params object of class SplatPopParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (GENES)  (CELLS)   [SEED] 
#>     300      239   527161 
#> 
#> 54 additional parameters 
#> 
#> Batches: 
#>          [BATCHES]       [BATCH CELLS]          [Location]             [Scale] 
#>                  6  40, 40, 39, 40,...                 0.1                 0.1 
#>           [Remove] 
#>              FALSE 
#> 
#> Mean: 
#>           (RATE)           (SHAPE) 
#> 23.2821639801622  3.94733789609969 
#> 
#> Library size: 
#>        (LOCATION)            (SCALE)             (Norm) 
#>  3.77419950862729  0.607742862092932              FALSE 
#> 
#> Exprs outliers: 
#>      (PROBABILITY)          (LOCATION)             (SCALE) 
#> 0.0300751879699248    1.16252190468824   0.106161982410875 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [Probability]    [Down Prob]     [Location]        [Scale] 
#>           0.1            0.5            0.1            0.4 
#> 
#> BCV: 
#>     (COMMON DISP)              (DOF) 
#> 0.685607421548063                Inf 
#> 
#> Dropout: 
#>             [Type]          (MIDPOINT)             (SHAPE) 
#>               none    1.03878641236957  -0.748977342342645 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8 
#> 
#> Population params: 
#>       (MEAN.SHAPE)         (MEAN.RATE)    [POP.QUANT.NORM]  [SIMILARITY.SCALE] 
#>   11.8901624639648    54.5885689409818               FALSE                   2 
#>       [BATCH.SIZE]     [nCells.sample]      [nCells.shape]       [nCells.rate] 
#>                  1               FALSE                 1.5               0.015 
#>          [cv.bins] 
#>                 10 
#> 
#> (CV.PARAMS)
#> data.frame (10 x 4) with columns: start, end, shape, rate 
#>   start   end    shape     rate
#> 1 0.000 0.150 11.87344 16.50311
#> 2 0.150 0.158 13.78051 17.24523
#> 3 0.158 0.167 12.99690 16.51243
#> 4 0.167 0.183 22.78417 31.76292
#> # ... with 6 more rows
#> 
#> eQTL params: 
#>                  [EQTL.N]                [EQTL.DIST] 
#>                      0.05                      1e+08 
#>            [eqtl.maf.min]             [eqtl.maf.max] 
#>                      0.05                          1 
#>              [eqtl.coreg]      [eqtl.group.specific] 
#>                         0                        0.2 
#> [eqtl.condition.specific]            (eqtl.ES.shape) 
#>                       0.2                        3.6 
#>            (eqtl.ES.rate) 
#>                        12 
#> 
#> Condition params: 
#>    [nConditions]  [condition.prob]        [cde.prob]    [cde.downProb] 
#>                1                 1               0.1               0.5 
#>     [cde.facLoc]    [cde.facScale] 
#>              0.1               0.4
microglia_out$sim
#> class: SingleCellExperiment 
#> dim: 300 239 
#> metadata(3): Params Simulated_Means simPIC.population.model
#> assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
#> rownames(300): gene_001 gene_002 ... gene_299 gene_300
#> rowData names(34): MAP46246604.Group1.Gene._
#>   MAP46246604.Group1.GeneMean._ ... eQTL.EffectSize
#>   ConditionDE.Condition1
#> colnames(239): MAP46246604:Cell1 MAP46246604:Cell2 ... SM-CTECR:Cell39
#>   SM-CTECR:Cell40
#> colData names(6): Cell Batch ... Group Condition
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Quick-Start Plots

The workflow returns both general PCA summaries and Poptrial-style comparison plots for the selected library-specific subset.

microglia_out$plots$comparison$cell_pca_by_sample

microglia_out$plots$comparison$sample_pca


microglia_out$plots$bluster$cell_pca_by_sample

microglia_out$plots$bluster$silhouette_width

microglia_out$plots$bluster$neighborhood_purity

Detailed Look

simPICMicrogliaExample() is a packaged example wrapper around the lower-level splatPop-style estimation and simulation helpers. It starts from microglia_sce.rds, prepares the data for estimation and simulation, and constructs the comparison subsets used in the diagnostic plots.

The most important returned components are:

  • sce: the filtered real-data SingleCellExperiment used for the main population-scale workflow
  • bigcounts: the real donor-batch unit with the most cells, used for the count-based part of parameter estimation
  • aggregated: donor-batch aggregated means used for the population-scale component of estimation
  • params: the estimated SplatPopParams
  • sim: the simulated SingleCellExperiment
  • comparison_real and comparison_sim: optional library-specific real and simulated objects when a comparison subset is requested
names(microglia_out)
#>  [1] "sce"             "bigcounts"       "aggregated"      "params"         
#>  [5] "sim"             "comparison_real" "comparison_sim"  "plot_samples"   
#>  [9] "plots"           "sample_map"      "peak_annot"      "vcf"
dim(microglia_out$sce)
#> [1] 300 239
dim(microglia_out$bigcounts)
#> [1] 300  40
dim(microglia_out$aggregated)
#> [1] 300   6
dim(microglia_out$sim)
#> [1] 300 239
dim(microglia_out$comparison_real)
#> NULL
dim(microglia_out$comparison_sim)
#> NULL

Why bigcounts?

The Microglia workflow follows the splatPop estimation pattern:

  • aggregated donor-batch means capture the population-scale structure
  • a single high-cell donor-batch subset provides the count-based information for single-cell estimation

Accordingly, bigcounts is not an arbitrary subset. It is the largest retained sample_batch unit in the filtered real data, and it is paired with aggregated during parameter estimation.

microglia_out$bigcounts
#> class: SingleCellExperiment 
#> dim: 300 40 
#> metadata(0):
#> assays(1): counts
#> rownames(300): Peak61395 Peak63173 ... Peak62680 Peak62933
#> rowData names(19): peak_seqnames peak_start ... N peak_id
#> colnames(40): D19-125457#AACTGGTTCTATACCT-1
#>   D19-125457#TTCGATTCACTGCTTC-1 ... D19-125457#ATAGTCGAGACCATAA-1
#>   D19-125457#CCTTAATCATGGAGGT-1
#> colData names(43): BlacklistRatio DoubletEnrichment ... Batch
#>   sample_batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
table(microglia_out$bigcounts$sample_batch)
#> 
#> D19-125457_Library4 
#>                  40
table(microglia_out$bigcounts$Sample)
#> 
#> D19-125457 
#>         40

Plotting Subset

The default plotting panels use the most abundant retained samples in the compact example. A specific library or pathology subset can be requested with comparison.batch and comparison.pathology when the input data include the desired groups.

microglia_out$plot_samples
#> [1] "D19-13169"  "D19-13166"  "D19-13157"  "D19-125457"
table(microglia_out$sce$Sample)
#> 
#> D19-125457  D19-13157  D19-13165  D19-13166  D19-13169  D19-13182 
#>         40         40         39         40         40         40
if ("Sample" %in% colnames(colData(microglia_out$sim))) {
    table(colData(microglia_out$sim)$Sample)
}
#> 
#> D19-125457  D19-13157  D19-13165  D19-13166  D19-13169  D19-13182 
#>         40         39         40         40         40         40

Interpreting The Plots

The returned plots address slightly different questions:

  • plots$bluster$cell_pca_by_sample: whether the real and simulated library show similar sample-level structure in PCA space
  • plots$bluster$silhouette_width: how well cells are separated by sample within the selected library
  • plots$bluster$neighborhood_purity: how pure local neighborhoods are with respect to sample labels
  • plots$comparison: simple side-by-side PCA panels

The plots$bluster set is the closest to the manuscript-style Figure 6 comparison and is the most useful starting point for assessing whether the simulated data recover the corresponding real-data structure.

Notes

  • This example is intentionally focused on donor- and batch-aware population simulation.
  • Pathology is not used in the core population estimation and simulation model, but it can be used to define a comparison subset when the packaged or user-supplied data contain the desired pathology groups.
  • The packaged chr14 peak annotation is treated as the peak-level source of truth for the example.
  • The packaged example SCE preserves the original Library and Pathology labels from the reference Microglia object so larger user-supplied examples can request focused comparison panels.

Session information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] simPIC_1.9.0                SingleCellExperiment_1.35.1
#>  [3] SummarizedExperiment_1.43.0 Biobase_2.73.1             
#>  [5] GenomicRanges_1.65.0        Seqinfo_1.3.0              
#>  [7] IRanges_2.47.2              S4Vectors_0.51.3           
#>  [9] BiocGenerics_0.59.6         generics_0.1.4             
#> [11] MatrixGenerics_1.25.0       matrixStats_1.5.0          
#> [13] BiocStyle_2.41.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.3.0                bitops_1.0-9             gridExtra_2.3           
#>   [4] rlang_1.2.0              magrittr_2.0.5           scater_1.41.1           
#>   [7] compiler_4.6.0           RSQLite_3.53.1           GenomicFeatures_1.65.0  
#>  [10] png_0.1-9                vctrs_0.7.3              pkgconfig_2.0.3         
#>  [13] crayon_1.5.3             fastmap_1.2.0            backports_1.5.1         
#>  [16] XVector_0.53.0           labeling_0.4.3           scuttle_1.23.1          
#>  [19] Rsamtools_2.29.0         rmarkdown_2.31           UCSC.utils_1.9.0        
#>  [22] ggbeeswarm_0.7.3         preprocessCore_1.75.0    bit_4.6.0               
#>  [25] xfun_0.57                bluster_1.23.0           cachem_1.1.0            
#>  [28] beachmat_2.29.0          cigarillo_1.3.0          GenomeInfoDb_1.49.1     
#>  [31] jsonlite_2.0.0           blob_1.3.0               DelayedArray_0.39.3     
#>  [34] BiocParallel_1.47.0      irlba_2.3.7              parallel_4.6.0          
#>  [37] cluster_2.1.8.2          R6_2.6.1                 VariantAnnotation_1.59.0
#>  [40] RColorBrewer_1.1-3       bslib_0.11.0             limma_3.69.1            
#>  [43] rtracklayer_1.73.0       jquerylib_0.1.4          Rcpp_1.1.1-1.1          
#>  [46] knitr_1.51               BiocBaseUtils_1.15.1     Matrix_1.7-5            
#>  [49] splines_4.6.0            igraph_2.3.1             viridis_0.6.5           
#>  [52] abind_1.4-8              yaml_2.3.12              codetools_0.2-20        
#>  [55] curl_7.1.0               lattice_0.22-9           S7_0.2.2                
#>  [58] withr_3.0.2              KEGGREST_1.53.0          evaluate_1.0.5          
#>  [61] survival_3.8-6           fitdistrplus_1.2-6       Biostrings_2.81.2       
#>  [64] BiocManager_1.30.27      checkmate_2.3.4          RCurl_1.98-1.18         
#>  [67] ggplot2_4.0.3            scales_1.4.0             glue_1.8.1              
#>  [70] maketools_1.3.2          tools_4.6.0              BiocIO_1.23.3           
#>  [73] BiocNeighbors_2.7.2      sys_3.4.3                ScaledMatrix_1.21.0     
#>  [76] BSgenome_1.81.0          locfit_1.5-9.12          GenomicAlignments_1.49.0
#>  [79] buildtools_1.0.0         XML_3.99-0.23            grid_4.6.0              
#>  [82] AnnotationDbi_1.75.0     edgeR_4.11.1             beeswarm_0.4.0          
#>  [85] BiocSingular_1.29.0      vipor_0.4.7              restfulr_0.0.16         
#>  [88] cli_3.6.6                rsvd_1.0.5               viridisLite_0.4.3       
#>  [91] S4Arrays_1.13.0          gtable_0.3.6             sass_0.4.10             
#>  [94] digest_0.6.39            ggrepel_0.9.8            SparseArray_1.13.2      
#>  [97] rjson_0.2.23             farver_2.1.2             memoise_2.0.1           
#> [100] htmltools_0.5.9          lifecycle_1.0.5          httr_1.4.8              
#> [103] splatter_1.37.0          statmod_1.5.2            bit64_4.8.2             
#> [106] MASS_7.3-65