1 Introduction

The MouseThymusAgeing package provides convenient access to the single-cell RNA sequencing (scRNA-seq) datasets from Baran-Gale et al. (2020). The study used single-cell transcriptomic profiling to resolve how the epithelial composition of the mouse thymus changes with ageing. The datasets from the paper are provided as count matrices with relevant sample-level and feature-level meta-data. All data are provided post-processing and QC. The raw sequencing data can be directly acquired from ArrayExpress using accessions E-MTAB-8560 and E-MTAB-8737.

2 Installation

The package can be installed from Bioconductor. Bioconductor packages can be accessed using the BiocManager package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MouseThymusAgeing")

To use the package, load it in the typical way.

library(MouseThymusAgeing)

3 Processsing Overview

Detailed experimental protocols are available in the manuscript and analytical details are provided in the accompanying GitHub repo.

This data package contains 2 single-cell data sets from the paper. The first details the initial transcriptomic profiling of defined TEC populations using the plate-based SMART-seq2 chemistry. These cells were sorted from mice at 1, 4, 16, 32 and 52 weeks of age using the following flow cytometry phenotypes:

mTEClo: Cd45- EpCam+ Ly51- Cd80lo MHCIIlo Dsg3-
mTEChi: Cd45- EpCam+ Ly51- Cd80hi MHCIIhi Dsg3-
cTEC: Cd45- EpCam+ Ly51+ Cd80-
Dsg3+ TEC: Cd45- EpCam+ Ly51- Cd80lo MHCIIlo Dsg3+

In each case cells were sorted from 5 separate mice at each age into a 384 well plate containing lysis buffer, with cells from different ages and days block sorted into different areas of each plate to minimise the confounding between batch effects, mouse age and sorted subpopulation. The single-cell libraries were prepared according to the SMART-seq2 protocol and sequenced on an Illumina NovaSeq 6000.

The computational processing invovled the following steps:

Read trimming with trimmomatic, prior to alignment to the mm10 genome with STAR v2.5.3a, and read de-duplication using Picard Tools.
FeatureCounts was used to count reads on each gene using the Ensembl v95 annotation.
Poor quality cells were removed based on excess of ERCC92 spiked-in sequences, poor sequencing coverage and low gene-detection.
Gene counts were normalized using size factors estimated using computeSumFactors() function from scran (L. Lun, Bach, and Marioni 2016).
Highly variable genes were identified, and used to subset the log (+1 pseudocount) normalised expression prior to PCA.
Shared nearest neighbour (SNN) graph building was performed using igraph and cells were clustered using the Walktrap community detection algorithm (Pons and Latapy 2005). Clusters were manually annotated based on inspecting the expression of marker genes.

The second dataset contains cells that were profiling from TEC at 8, 20 and 36 weeks old, derived from a transgenic model system that is also able to lineage trace cells that derive from those that express the thymoproteasomal gene, \(\beta\)-5t. When this gene is expressed it drives the expression of a fluorescent reporter gene, ZsGreen (ZsG). The mouse is denoted \(\mbox{3xtg}^{\beta5t}\). Each mouse (3 replicates per age) first had their transgene induced using doxycycline, and 4 weeks later the TEC were collected by flow cytometry in separate ZsG+ and ZsG- groups. Within each of these groups cells were FAC-sorted into mTEC (Cd45+EpCam+MHCII+Ly51-UEA1+) and cTEC (Cd45+EpCam+Ly51+UEA1+) populations. For this experiment we made us of recent developments in multiplexing with hashtag oligos (HTO; cell-hashing)(Stoeckius et al. 2018). Consequently, the cells were super-loaded onto the 10X Genomics Chromium chips before library prep and sequencing on an Illumina NovaSeq 6000.

The computational processing for these data is different to above. Specifically:

Demultiplexing, read alignment, UMI deduplication and feature counting were all performed using Cellranger v3.1.0.
Non-empty droplets were called separately using the emptyDrops() from the DropletUtils (participants in the 1st Human Cell Atlas Jamboree et al. 2019).
Cells were de-multiplexed into specific samples by assigning each cell to its best-matching hashtag oligo using the approach by (Stoeckius et al. 2018) - this was also able to identify multiplet cells containing multiple HTOs.
Cells were filtered based on low sequencing coverage and high deviation of mitochondrial content from the population median.
Size factors were estimated as above using computeSumFactors() function from scran (L. Lun, Bach, and Marioni 2016), and used for normalization with a log(X + 1) transformation.
A SNN-graph was then built using igraph as above, and cell were also clustered using Walktrap community detection algorithm (Pons and Latapy 2005). These clusters were annotated with concordant labels from the above data set. The exception being that many more clusters were identified, and thus each cluster was suffixed with a number to uniquely identify them.

4 Package data format

The SMART-seq2 data is stored in subsets according to the sorting day (numbered 1-5). For the droplet data, the data can be accessed according to the specific multiplexed samples (6 in total). For the SMART-seq2 the exported object SMARTseqMetadata provides the relevant metadata information for each sorting day, the equivalent object DropletMetadata contains the relevant information for each separate sample. Specific descriptions of each column can be accessed using ?SMARTseqMetadata and ?DropletMetadata.

head(SMARTseqMetadata, n = 5)

##   sample  Age Gender ncells
## 1   day1  1wk female     10
## 2   day1  4wk female     19
## 3   day1 16wk female     10
## 4   day1 32wk female     43
## 5   day1 52wk female     52

All of the data access functions allow you to select the particular samples or sorting days that you would like to access for the relevant data set. By loading only the samples or sorting days that you are interested in for your particular analysis, you will save time when downloading and loading the data, and also reduce memory consumption on your machine.

Droplet single-cell experiments tend to be much larger owing to the ability to encapsulate and process many more cells than in either 96- or 384-well plates. The droplet scRNA-seq made use of hashtag oligonucleotides to multiplex samples, allowing for replicated experimental design without breaking the bank.

head(DropletMetadata, n = 5)

##        sample Gender       HTO ncells
## 1 ZsG_1stRun1 female  Wk1_ZsGp   5160
## 2 ZsG_1stRun1 female  Wk4_ZsGp   3902
## 3 ZsG_1stRun1 female Wk16_ZsGp   3798
## 4 ZsG_1stRun2 female  Wk1_ZsGp   3360
## 5 ZsG_1stRun2 female  Wk4_ZsGp   2355

4.1 Data access

Package data are provided as SingleCellExperiment objects, an extension of the Bioconductor SummarizedExperiment object for high-throughput omics experiment data. SingleCellExperiment object uses memory-efficient storage and sparse matrices to store the single-cell experiment data, whilst allowing the layering of additional feature- and cell-wise meta-data to facilitate single-cell analyses. This section will detail how to access and interact with these objects from the MouseThymusAgeing package.

smart.sce <- MouseSMARTseqData(samples="day2")
smart.sce

## class: SingleCellExperiment 
## dim: 48801 661 
## metadata(0):
## assays(1): counts
## rownames(48801): ERCC-00002 ERCC-00003 ... ENSMUSG00000064371
##   ENSMUSG00000064372
## rowData names(6): Geneid Chr ... Strand Length
## colnames(661): B1.B002294.2_32.4.52.1_S109 B10.B002294.2_32.4.52.1_S118
##   ... P9.B002284.2_52.1.32.1_S9 P9.B002294.2_32.4.52.1_S153
## colData names(11): CellID ClusterID ... SubType sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

The gene counts are stored in the assays(sce, "counts") slot, which can be accessed using the convenience function counts. The gene counts are stored in a memory efficient sparse matrix class from the Matrix package.

head(counts(smart.sce)[, 1:10])

##            B1.B002294.2_32.4.52.1_S109 B10.B002294.2_32.4.52.1_S118
## ERCC-00002                        2071                         2739
## ERCC-00003                           1                           70
## ERCC-00004                         179                          533
## ERCC-00009                         507                            0
## ERCC-00012                           0                            0
## ERCC-00013                           0                            0
##            B11.B002284.2_52.1.32.1_S275 B11.B002294.2_32.4.52.1_S119
## ERCC-00002                         2612                         1722
## ERCC-00003                           47                          140
## ERCC-00004                          193                          210
## ERCC-00009                          898                            0
## ERCC-00012                            0                            0
## ERCC-00013                            0                            0
##            B13.B002294.2_32.4.52.1_S121 B14.B002284.2_52.1.32.1_S278
## ERCC-00002                         1853                         1987
## ERCC-00003                          130                          111
## ERCC-00004                          417                           60
## ERCC-00009                          487                            0
## ERCC-00012                            0                            0
## ERCC-00013                            0                            0
##            B14.B002294.2_32.4.52.1_S122 B14.B002295.2_1.16.4.1_S206
## ERCC-00002                         1788                       14714
## ERCC-00003                          112                          13
## ERCC-00004                          183                        1251
## ERCC-00009                          479                           0
## ERCC-00012                            0                           0
## ERCC-00013                            0                           0
##            B16.B002284.2_52.1.32.1_S280 B17.B002283.2_4.52.16.1_S197
## ERCC-00002                         2458                         3697
## ERCC-00003                           35                          150
## ERCC-00004                           62                          387
## ERCC-00009                          236                            0
## ERCC-00012                            0                            0
## ERCC-00013                            0                            0

The normalisation factors per cell can be accessed using the sizeFactors() function.

head(sizeFactors((smart.sce)))

## [1] 1.3178659 1.7034129 1.5202621 0.4705083 2.2748982 1.8011869

These are used to normalise the data. To generate single-cell expression values on a log-normal scale, we can apply the logNormCounts from the scuttle package. This will add the logcounts entry to the assays slot in our object.

library(scuttle)
smart.sce <- logNormCounts(smart.sce)

With these normalised counts we can perform our standard down-stream analytical tasks, such as identifying highly variable genes, projecting cells into a reduced dimensional space and clustering using a nearest-neighbour graph. You can further inspect the cell-wise meta-data attached to each dataset, stored in the colData for each SingleCellExperiment object.

head(colData(smart.sce))

## DataFrame with 6 rows and 11 columns
##                                              CellID ClusterID    Position
##                                         <character> <integer> <character>
## B1.B002294.2_32.4.52.1_S109  B1.B002294.2_32.4.52..         8          B1
## B10.B002294.2_32.4.52.1_S118 B10.B002294.2_32.4.5..         2         B10
## B11.B002284.2_52.1.32.1_S275 B11.B002284.2_52.1.3..         2         B11
## B11.B002294.2_32.4.52.1_S119 B11.B002294.2_32.4.5..         2         B11
## B13.B002294.2_32.4.52.1_S121 B13.B002294.2_32.4.5..         2         B13
## B14.B002284.2_52.1.32.1_S278 B14.B002284.2_52.1.3..         2         B14
##                                  PlateID    Column         Row    SortType
##                              <character> <integer> <character> <character>
## B1.B002294.2_32.4.52.1_S109      B002294         1           B      mTEClo
## B10.B002294.2_32.4.52.1_S118     B002294        10           B      mTEClo
## B11.B002284.2_52.1.32.1_S275     B002284        11           B      mTEClo
## B11.B002294.2_32.4.52.1_S119     B002294        11           B      mTEClo
## B13.B002294.2_32.4.52.1_S121     B002294        13           B      mTEClo
## B14.B002284.2_52.1.32.1_S278     B002284        14           B      mTEClo
##                                SortDay         Age          SubType sizeFactor
##                              <integer> <character>      <character>  <numeric>
## B1.B002294.2_32.4.52.1_S109          2        32wk             nTEC   0.898811
## B10.B002294.2_32.4.52.1_S118         2         4wk Intertypical TEC   1.161762
## B11.B002284.2_52.1.32.1_S275         2         1wk Intertypical TEC   1.036849
## B11.B002294.2_32.4.52.1_S119         2         4wk Intertypical TEC   0.320896
## B13.B002294.2_32.4.52.1_S121         2         4wk Intertypical TEC   1.551526
## B14.B002284.2_52.1.32.1_S278         2         1wk Intertypical TEC   1.228445

Details of what information is stored can be found in the documentation using ?DropletMetadata and ?SMARTseqMetada. In each object we also have the pre-computed reduced dimensions that can be accessed through the reducedDim(<sce>, "PCA") slot.

5 Session Information

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scuttle_1.8.4               MouseThymusAgeing_1.6.1    
##  [3] SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
##  [5] Biobase_2.58.0              GenomicRanges_1.50.2       
##  [7] GenomeInfoDb_1.34.9         IRanges_2.32.0             
##  [9] S4Vectors_0.36.1            BiocGenerics_0.44.0        
## [11] MatrixGenerics_1.10.0       matrixStats_0.63.0         
## [13] BiocStyle_2.26.0           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.4                    sass_0.4.5                   
##  [3] bit64_4.0.5                   jsonlite_1.8.4               
##  [5] AnnotationHub_3.6.0           DelayedMatrixStats_1.20.0    
##  [7] bslib_0.4.2                   shiny_1.7.4                  
##  [9] assertthat_0.2.1              interactiveDisplayBase_1.36.0
## [11] BiocManager_1.30.19           BiocFileCache_2.6.0          
## [13] blob_1.2.3                    GenomeInfoDbData_1.2.9       
## [15] yaml_2.3.7                    BiocVersion_3.16.0           
## [17] pillar_1.8.1                  RSQLite_2.2.20               
## [19] lattice_0.20-45               beachmat_2.14.0              
## [21] glue_1.6.2                    digest_0.6.31                
## [23] promises_1.2.0.1              XVector_0.38.0               
## [25] htmltools_0.5.4               httpuv_1.6.9                 
## [27] Matrix_1.5-3                  pkgconfig_2.0.3              
## [29] bookdown_0.32                 zlibbioc_1.44.0              
## [31] purrr_1.0.1                   xtable_1.8-4                 
## [33] later_1.3.0                   BiocParallel_1.32.5          
## [35] tibble_3.1.8                  KEGGREST_1.38.0              
## [37] generics_0.1.3                ellipsis_0.3.2               
## [39] withr_2.5.0                   cachem_1.0.6                 
## [41] cli_3.6.0                     crayon_1.5.2                 
## [43] magrittr_2.0.3                mime_0.12                    
## [45] memoise_2.0.1                 evaluate_0.20                
## [47] fansi_1.0.4                   tools_4.2.2                  
## [49] lifecycle_1.0.3               DelayedArray_0.24.0          
## [51] Biostrings_2.66.0             AnnotationDbi_1.60.0         
## [53] compiler_4.2.2                jquerylib_0.1.4              
## [55] rlang_1.0.6                   grid_4.2.2                   
## [57] RCurl_1.98-1.10               rappdirs_0.3.3               
## [59] bitops_1.0-7                  rmarkdown_2.20               
## [61] ExperimentHub_2.6.0           codetools_0.2-19             
## [63] DBI_1.1.3                     curl_5.0.0                   
## [65] R6_2.5.1                      knitr_1.42                   
## [67] dplyr_1.1.0                   fastmap_1.1.0                
## [69] bit_4.0.5                     utf8_1.2.3                   
## [71] filelock_1.0.2                parallel_4.2.2               
## [73] Rcpp_1.0.10                   vctrs_0.5.2                  
## [75] png_0.1-8                     sparseMatrixStats_1.10.0     
## [77] dbplyr_2.3.0                  tidyselect_1.2.0             
## [79] xfun_0.37

References

Baran-Gale, Jeanette, Michael D Morgan, Stefano Maio, Fatima Dhalla, Irene Calvo-Asensio, Mary E Deadman, Adam E Handel, et al. 2020. “Ageing Compromises Mouse Thymus Function and Remodels Epithelial Cell Differentiation.” eLife 9 (August). https://doi.org/10.7554/eLife.56221.

L. Lun, Aaron T., Karsten Bach, and John C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biology 17 (1). https://doi.org/10.1186/s13059-016-0947-7.

participants in the 1st Human Cell Atlas Jamboree, Aaron T. L. Lun, Samantha Riesenfeld, Tallulah Andrews, The Phuong Dao, Tomas Gomes, and John C. Marioni. 2019. “EmptyDrops: Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell RNA Sequencing Data.” Genome Biology 20 (1). https://doi.org/10.1186/s13059-019-1662-y.

Pons, P., and M. Latapy. 2005. “Computing Communities in Large Networks Using Random Walks (Long Version).” ArXiv Physics E-Prints, December.

Stoeckius, Marlon, Shiwei Zheng, Brian Houck-Loomis, Stephanie Hao, Bertrand Z. Yeung, William M. Mauck, Peter Smibert, and Rahul Satija. 2018. “Cell Hashing with Barcoded Antibodies Enables Multiplexing and Doublet Detection for Single Cell Genomics.” Genome Biology 19 (1). https://doi.org/10.1186/s13059-018-1603-1.

Overview of the MouseThymusAgeing datasets

November 4, 2020