1 Introduction

The scRNAseq package provides convenient access to several publicly available data sets in the form of SingleCellExperiment objects. The focus of this package is to capture datasets that are not easily read into R with a one-liner from, e.g., read.csv(). Instead, we do the necessary data munging so that users only need to call a single function to obtain a well-formed SingleCellExperiment. For example:

fluidigm <- ReprocessedFluidigmData()
## class: SingleCellExperiment 
## dim: 26255 130 
## metadata(3): sample_info clusters which_qc
## assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
## rownames(26255): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(130): SRR1275356 SRR1274090 ... SRR1275366 SRR1275261
## colData names(28): NREADS NALIGNED ... Cluster1 Cluster2
## reducedDimNames(0):
## altExpNames(0):

Readers are referred to the SummarizedExperiment and SingleCellExperiment documentation for further information on how to work with SingleCellExperiment objects.

2 Available data sets

The listDatasets() function returns all available datasets in scRNAseq, along with some summary statistics and the necessary R command to load them.

out <- listDatasets()
Reference Taxonomy Part Number Call
Aztekin et al. (2019) Xenopus tail 13199 AztekinTailData()
Bach et al. (2017) Mouse mammary gland 25806 BachMammaryData()
Baron et al. (2016) Human pancreas 8569 BaronPancreasData('human')
Baron et al. (2016) Mouse pancreas 1886 BaronPancreasData('mouse')
Buettner et al. (2015) Mouse embryonic stem cells 288 BuettnerESCData()
Campbell et al. (2017) Mouse brain 21086 CampbellBrainData()
Chen et al. (2017) Mouse brain 14437 ChenBrainData()
Grun et al. (2016) Mouse haematopoietic stem cells 1915 GrunHSCData()
Grun et al. (2016) Human pancreas 1728 GrunPancreasData()
Hermann et al. (2018) Mouse spermatogenic cells 2325 HermannSpermatogenesisData()
Hu et al. (2017) Mouse cortex 48000 HuCortexData()
Kolodziejczyk et al. (2015) Mouse embryonic stem cells 704 KolodziejczykESCData()
La Manno et al. (2016) Human embryonic stem cells 1715 LaMannoBrainData('human-es')
La Manno et al. (2016) Human embryonic midbrain 1977 LaMannoBrainData('human-embryo')
La Manno et al. (2016) Human induced pluripotent stem cells 337 LaMannoBrainData('human-ips')
La Manno et al. (2016) Mouse adult dopaminergic neurons 243 LaMannoBrainData('mouse-adult')
La Manno et al. (2016) Human embyronic midbrain 1907 LaMannoBrainData('mouse-embryo')
Lawlor et al. (2017) Human pancreas 638 LawlorPancreasData()
Leng et al. (2015) Human embryonic stem cells 460 LengESCData()
Lun et al. (2017) Mouse 416B cells 192 LunSpikeInData('416b')
Lun et al. (2017) Mouse trophoblasts 192 LunSpikeInData('tropho')
Macosko et al. (2015) Mouse retina 49300 MacoskoRetinaData()
Mahata et al. (2014) Mouse T helper cells 96 ReprocessedTh2Data()
(???) Human peripheral blood mononuclear cells 29033 MairPBMCData()
Kotliarov et al. (2020) Human peripheral blood mononuclear cells 58654 KotliarovPBMCData()
Marques et al. (2016) Mouse brain 5069 MarquesBrainData()
Messmer et al. (2019) Human embryonic stem cells 1344 MessmerESCData()
Muraro et al. (2016) Human pancreas 3072 MuraroPancreasData()
Nestorowa et al. (2016) Mouse haematopoietic stem cells 1920 NestorowaHSCData()
Paul et al. (2015) Mouse haematopoietic stem cells 10368 PaulHSCData()
Pollen et al. (2014) Human cortex 65 ReprocessedFluidigmData()
Richard et al. (2018) Mouse CD8+ T cells 572 RichardTCellData()
Romanov et al. (2017) Mouse brain 2881 RomanovBrainData()
Segerstolpe et al. (2016) Human pancreas 3514 SegerstolpePancreasData()
Shekhar et al. (2016) Mouse retina 44994 ShekharRetinaData()
Stoeckius et al. (2018) Mouse peripheral blood mononuclear cells 50000 StoeckiusHashingData(mode='mouse')
Stoeckius et al. (2018) Human peripheral blood mononuclear cells 50000 StoeckiusHashingData(mode='human')
Stoeckius et al. (2018) Human HEK, THP1, K562, KG1 cells 30000 StoeckiusHashingData(type='mixed')
Usoskin et al. (2015) Mouse brain 864 UsoskinBrainData()
Tasic et al. (2016) Mouse brain 1809 TasicBrainData()
Tasic et al. (2016) Mouse visual cortex 379 ReprocessedAllenData()
Wu et al. (2019) Mouse kidney 17542 WuKidneyData()
Xin et al. (2016) Human pancreas 1600 XinPancreasData()
Zeisel et al. (2015) Mouse brain 3005 ZeiselBrainData()
Zilionis et al. (2019) Human lung 173954 ZilionisLungData()
Zilionis et al. (2019) Mouse lung 17549 ZilionisLungData('mouse')

If the original dataset was not provided with Ensembl annotation, we can map the identifiers with ensembl=TRUE. Any genes without a corresponding Ensembl identifier is discarded from the dataset.

sce <- ZeiselBrainData(ensembl=TRUE)
## [1] "ENSMUSG00000029669" "ENSMUSG00000046982" "ENSMUSG00000039735"
## [4] "ENSMUSG00000033453" "ENSMUSG00000046798" "ENSMUSG00000034009"

Functions also have a location=TRUE argument that loads in the gene coordinates.

sce <- ZeiselBrainData(ensembl=TRUE, location=TRUE)
## GRanges object with 6 ranges and 2 metadata columns:
##                      seqnames              ranges strand | featureType
##                         <Rle>           <IRanges>  <Rle> | <character>
##   ENSMUSG00000029669        6   21771395-21852515      - |  endogenous
##   ENSMUSG00000046982       18   84011627-84087706      - |  endogenous
##   ENSMUSG00000039735        3 122538719-122619715      - |  endogenous
##   ENSMUSG00000033453        9   30899155-30922452      - |  endogenous
##   ENSMUSG00000046798        5     5489537-5514958      - |  endogenous
##   ENSMUSG00000034009        3   79641611-79737880      - |  endogenous
##                      originalName
##                       <character>
##   ENSMUSG00000029669      Tspan12
##   ENSMUSG00000046982        Tshz1
##   ENSMUSG00000039735       Fnbp1l
##   ENSMUSG00000033453     Adamts15
##   ENSMUSG00000046798       Cldn12
##   ENSMUSG00000034009        Rxfp1
##   -------
##   seqinfo: 118 sequences from GRCm38 genome

3 Adding new data sets

Please contact us if you have a data set that you would like to see added to this package. The only requirement is that your data set has publicly available expression values (ideally counts) and sample annotation. The more difficult/custom the format, the better, as its inclusion in this package will provide more value for other users in the R/Bioconductor community.

If you have already written code that processes your desired data set in a SingleCellExperiment-like form, we would welcome a pull request here. The process can be expedited by ensuring that you have the following files:

Potential contributors are recommended to examine some of the existing scripts in the package to pick up the coding conventions. Remember, we’re more likely to accept a contribution if it’s indistinguishable from something we might have written ourselves!


