The MouseGastrulationData package provides convenient access to the single-cell RNA sequencing (scRNA-seq) datasets from Pijuan-Sala et al. (2019), and additional data generated in similar systems. This study focuses on mouse gastrulation and organogenesis, providing transcriptomic profiles at single-cell resolution across several stages of early development. Datasets are provided as count matrices with additional feature- and sample-level metadata after processing. Raw sequencing data can be acquired from ArrayExpress accession E-MTAB-6967.
The package may be installed from Bioconductor. Bioconductor packages can be accessed using the BiocManager package.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MouseGastrulationData")
BiocManager also supports installation of the development version of the package from Github.
BiocManager::install("MarioniLab/MouseGastrulationData")
To use the package, load it in the typical way.
library(MouseGastrulationData)
Detailed methods are available in the methods that accompany the paper, or from the code in the corresponding Github repository. Briefly, whole embryos were dissociated at timepoints between embryonic days (E) 6.5 and 8.5 of development. Libraries were generated using the 10x Genomics Chromium platform (v1 chemistry) and sequenced on the Illumina HiSeq 2500. The computational analysis involved a number of steps:
swappedDrops()
function from DropletUtils (Griffiths et al. 2018).emptyDrops()
function from DropletUtils (Lun et al. 2019).computeSumFactors()
function from scran (Lun, Bach, and Marioni 2016).doubletCells()
function from scran.fastMNN()
from scran (Haghverdi et al. 2018).buildSNNGraph()
(from scran)
and cluster_louvain
(from igraph), and were annotated and merged into interpretable units by hand.The data accessible via this package is stored in subsets according to the different 10x samples that were generated.
For the embryo atlas, the exported object AtlasSampleMetadata
provides metadata information for each of the samples.
Descriptions of the contents of each column can be accessed using ?AtlasSampleMetadata
.
head(AtlasSampleMetadata, n = 3)
## sample stage pool_index seq_batch ncells
## 1 1 E6.5 1 1 360
## 2 2 E7.5 2 1 356
## 3 3 E7.5 3 1 458
All data access functions allow you to select the particular samples you would like to access. By loading only the samples that you are interested in for your particular analysis, you will save time when downloading and loading the data, and also reduce memory consumption on your machine.
The package provides the dataset in the form of a SingleCellExperiment
object.
This section details how you can interact with the object.
We load in only one of the samples from the atlas to reduce memory consumption when compiling this vignette.
sce <- EmbryoAtlasData(samples = 21)
sce
## class: SingleCellExperiment
## dim: 29452 4651
## metadata(0):
## assays(1): counts
## rownames(29452): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ENSEMBL SYMBOL
## colnames(4651): cell_52466 cell_52467 ... cell_57115 cell_57116
## colData names(17): cell barcode ... colour sizeFactor
## reducedDimNames(2): pca.corrected umap
## altExpNames(0):
We use the counts()
function to retrieve the count matrix.
These are stored as a sparse matrix, as implemented in the Matrix package.
counts(sce)[6:9, 1:3]
## 4 x 3 sparse Matrix of class "dgTMatrix"
## cell_52466 cell_52467 cell_52468
## ENSMUSG00000104328 . . .
## ENSMUSG00000033845 6 8 10
## ENSMUSG00000025903 . . .
## ENSMUSG00000104217 . . .
Size factors for normalisation are present in the object and are accessed with the sizeFactors()
function.
head(sizeFactors(sce))
## [1] 0.8845695 1.4688375 1.2512019 0.8287969 1.3668086 0.9247460
After running scater’s normalize
function on the SingleCellExperiment object, normalised or log-transformed counts can be accessed using normcounts
and logcounts
.
These are not demonstrated in this vignette to avoid a dependency on scater.
The MGI symbol and Ensembl gene ID for each gene is stored in the rowData
of the SingleCellExperiment
object.
head(rowData(sce))
## DataFrame with 6 rows and 2 columns
## ENSEMBL SYMBOL
## <character> <character>
## ENSMUSG00000051951 ENSMUSG00000051951 Xkr4
## ENSMUSG00000089699 ENSMUSG00000089699 Gm1992
## ENSMUSG00000102343 ENSMUSG00000102343 Gm37381
## ENSMUSG00000025900 ENSMUSG00000025900 Rp1
## ENSMUSG00000025902 ENSMUSG00000025902 Sox17
## ENSMUSG00000104328 ENSMUSG00000104328 Gm37323
The colData
contains cell-specific attributes.
The meaning of each field is detailed in the function documentation (?EmbryoAtlasData
).
head(colData(sce))
## DataFrame with 6 rows and 17 columns
## cell barcode sample pool stage
## <character> <character> <integer> <integer> <character>
## cell_52466 cell_52466 AAACATACACGGAG 21 17 mixed_gastrulation
## cell_52467 cell_52467 AAACATACCCAACA 21 17 mixed_gastrulation
## cell_52468 cell_52468 AAACATACTTGCGA 21 17 mixed_gastrulation
## cell_52469 cell_52469 AAACATTGATCGGT 21 17 mixed_gastrulation
## cell_52470 cell_52470 AAACATTGCTTATC 21 17 mixed_gastrulation
## cell_52471 cell_52471 AAACATTGGTTCGA 21 17 mixed_gastrulation
## sequencing.batch theiler doub.density doublet cluster
## <integer> <character> <numeric> <logical> <integer>
## cell_52466 2 TS9-10 0.0315539 FALSE 14
## cell_52467 2 TS9-10 0.1362419 FALSE 3
## cell_52468 2 TS9-10 0.7468976 FALSE 2
## cell_52469 2 TS9-10 0.2704532 FALSE 1
## cell_52470 2 TS9-10 0.2226039 FALSE 19
## cell_52471 2 TS9-10 0.3261519 FALSE 5
## cluster.sub cluster.stage cluster.theiler stripped
## <integer> <integer> <integer> <logical>
## cell_52466 2 5 5 FALSE
## cell_52467 6 12 12 FALSE
## cell_52468 3 3 3 FALSE
## cell_52469 3 1 1 FALSE
## cell_52470 1 5 5 FALSE
## cell_52471 1 4 4 FALSE
## celltype colour sizeFactor
## <character> <character> <numeric>
## cell_52466 Blood progenitors 2 c9a997 0.884569
## cell_52467 ExE ectoderm 989898 1.468838
## cell_52468 Epiblast 635547 1.251202
## cell_52469 Rostral neurectoderm 65A83E 0.828797
## cell_52470 Haematoendothelial progenitors FBBE92 1.366809
## cell_52471 Nascent mesoderm C594BF 0.924746
Batch-corrected PCA representations of the data are available via the reducedDim
function, in the pca.corrected
slot.
This representation contains NA
values for cells that are doublets, or cytoplasm-stripped nuclei.
A vector of celltype colours (as used in the paper) is also provided in the exported object EmbryoCelltypeColours
.
Its use is shown below.
#exclude technical artefacts
singlets <- which(!(colData(sce)$doublet | colData(sce)$stripped))
plot(
x = reducedDim(sce, "umap")[singlets, 1],
y = reducedDim(sce, "umap")[singlets, 2],
col = EmbryoCelltypeColours[colData(sce)$celltype[singlets]],
pch = 19,
xaxt = "n", yaxt = "n",
xlab = "UMAP1", ylab = "UMAP2"
)
Unfiltered count matrices are also available from MouseGastrulationData.
This refers to count matrices where swapped molecules have been removed but no cells have been called.
They can be obtained using the EmbryoAtlasData()
function and are returned as SingleCellExperiment
objects.
unfilt <- EmbryoAtlasData(type="raw", samples=c(1:2))
sapply(unfilt, dim)
## 1 2
## [1,] 29452 29452
## [2,] 117107 107802
These unfiltered matrices may be useful if you want to perform tests of cell-calling analyses, or analyses which use the ambient pool of RNA in 10x samples. Note that empty columns are excluded from these matrices.
Data from experiments involving chimeric embryos in Pijuan-Sala et al. (2019) are also available from this package. In these embryos, a population of fluorescent embryonic stem cells were injected into wild-type E3.5 mouse embryos. The embryos were then returned to a parent mouse, and allowed to develop normally until collection. The cells were flow-sorted to purify host and injected populations, libraries were generated using 10x version 2 chemistry and sequencing was performed on the HiSeq 4000.
Chimeras are especially effective for studying the effect of knockouts of essential developmental genes. We inject stem cells that possess a knockout of a particular gene, and allow the resulting chimeric embryo to develop. Both injected and host cells contribute to the different tissues in the mouse. The presence of the wild-type host cells allows the embryo to compensate and avoid gross developmental failures, while cells with the knockout are also captured, and their aberrant behaviour can be studied.
The package contains two chimeric datasets:
WTChimeraData()
function.Tal1ChimeraData()
function.The processed data for each experiment are provided as a SingleCellExperiment
, as for the previously described atlas data.
However, there are a few small differences:
colData
field tomato
.colData
field pool
.Unfiltered count matrices are also provided for each sample in these datasets.
A user might want to use these data outside of the Bioconductor framework in which it is provided from this package.
Fortunately, there are several packages available for R that facilitate this.
The LoomExperiment package will allow you to save the data as a .loom
file, which would be easily imported into python for use with e.g. scanpy.
Alternatively, you might want to consider loomR, which is available through Github.
Seurat has a function as.Seurat
to directly convert SingleCellExperiment files directly to Seurat-friendly objects.
In any case, it is likely that this package is the easiest way to access the mouse gastrulation datasets, regardless of how you wish to analyse it downstream.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] MouseGastrulationData_1.2.0 SingleCellExperiment_1.10.1
## [3] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
## [5] matrixStats_0.56.0 Biobase_2.48.0
## [7] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
## [9] IRanges_2.22.1 S4Vectors_0.26.0
## [11] BiocGenerics_0.34.0 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 lattice_0.20-41
## [3] assertthat_0.2.1 digest_0.6.25
## [5] mime_0.9 BiocFileCache_1.12.0
## [7] R6_2.4.1 RSQLite_2.2.0
## [9] evaluate_0.14 httr_1.4.1
## [11] pillar_1.4.4 zlibbioc_1.34.0
## [13] rlang_0.4.6 curl_4.3
## [15] blob_1.2.1 magick_2.3
## [17] Matrix_1.2-18 rmarkdown_2.1
## [19] AnnotationHub_2.20.0 stringr_1.4.0
## [21] RCurl_1.98-1.2 bit_1.1-15.2
## [23] shiny_1.4.0.2 compiler_4.0.0
## [25] httpuv_1.5.2 xfun_0.13
## [27] pkgconfig_2.0.3 htmltools_0.4.0
## [29] tidyselect_1.0.0 tibble_3.0.1
## [31] GenomeInfoDbData_1.2.3 interactiveDisplayBase_1.26.0
## [33] bookdown_0.18 crayon_1.3.4
## [35] dplyr_0.8.5 dbplyr_1.4.3
## [37] later_1.0.0 bitops_1.0-6
## [39] rappdirs_0.3.1 grid_4.0.0
## [41] xtable_1.8-4 lifecycle_0.2.0
## [43] DBI_1.1.0 magrittr_1.5
## [45] stringi_1.4.6 XVector_0.28.0
## [47] promises_1.1.0 ellipsis_0.3.0
## [49] vctrs_0.2.4 tools_4.0.0
## [51] bit64_0.9-7 glue_1.4.0
## [53] BiocVersion_3.11.1 purrr_0.3.4
## [55] fastmap_1.0.1 yaml_2.2.1
## [57] AnnotationDbi_1.50.0 BiocManager_1.30.10
## [59] ExperimentHub_1.14.0 memoise_1.1.0
## [61] knitr_1.28
Griffiths, J. A., A. C. Richard, K. Bach, A. T. L. Lun, and J. C. Marioni. 2018. “Detection and removal of barcode swapping in single-cell RNA-seq data.” Nat Commun 9 (1):2667.
Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.
Lun, A. T. L., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, and J. C. Marioni. 2019. “EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.” Genome Biol. 20 (1):63.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.
Pijuan-Sala, Blanca, Jonathan A. Griffiths, Carolina Guibentif, Tom W. Hiscock, Wajid Jawaid, Fernando J. Calero-Nieto, Carla Mulas, et al. 2019. “A Single-Cell Molecular Map of Mouse Gastrulation and Early Organogenesis.” Nature 566 (7745):490–95. https://doi.org/10.1038/s41586-019-0933-9.