Compiled date: 2024-05-02

Last edited: 2019-05-24

License: MIT + file LICENSE

1 Introducing the HCAData package

The HCAData package allows a direct access to the dataset generated by the Human Cell Atlas project for further processing in R and Bioconductor. It does so by providing the datasets as SingleCellExperiment objects, i.e. a format which is both efficient and very widely adopted throughout many existing Bioconductor workflows.

The datasets are otherwise available in other formats (also as raw data) at this link: http://preview.data.humancellatlas.org/.

2 Installing HCAData

The HCAData package can be installed in the conventional way via BiocManager.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("HCAData")

This package makes extensive use of the HDF5Array package to avoid loading the entire data set in memory. Instead, it stores the counts on disk as a HDF5 file, and loads subsets of the data into memory upon request.

3 Loading HCAData and the Human Cell Atlas data

library("HCAData")

We use the HCAData function to download the relevant files from Bioconductor’s ExperimentHub web resource. If no argument is provided, a list of the available datasets is returned, specifying which name to enter as dataset parameter when calling HCAData.

HCAData()

The list of relevant files includes the HDF5 file containing the counts, as well as the metadata on the rows (genes) and columns (cells).

The output is a single SingleCellExperiment object from the SingleCellExperiment package.

Being based on ExperimentHub, the data related to this package can be accessed and queried directly using the package name. Retrieval is then as easy as using their ExperimentHub accession numbers (for the single components of each set), or by using the convenience function provided in this package.

suppressPackageStartupMessages({
  library("ExperimentHub")
  library("SingleCellExperiment")
})

eh <- ExperimentHub()
query(eh, "HCAData")
#> ExperimentHub with 6 records
#> # snapshotDate(): 2024-04-29
#> # $dataprovider: Human Cell Atlas
#> # $species: Homo sapiens
#> # $rdataclass: character
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH2047"]]' 
#> 
#>            title                                                              
#>   EH2047 | Human Cell Atlas - Census of Immune Cells, Bone marrow, 'dense m...
#>   EH2048 | Human Cell Atlas - Census of Immune Cells, Bone marrow, sample (...
#>   EH2049 | Human Cell Atlas - Census of Immune Cells, Bone marrow, gene (ro...
#>   EH2050 | Human Cell Atlas - Census of Immune Cells, Umbilical cord blood,...
#>   EH2051 | Human Cell Atlas - Census of Immune Cells, Umbilical cord blood,...
#>   EH2052 | Human Cell Atlas - Census of Immune Cells, Umbilical cord blood,...

# these three are the components to the bone marrow dataset
bonemarrow_h5densematrix <- eh[["EH2047"]]
bonemarrow_coldata <- eh[["EH2048"]]
bonemarrow_rowdata <- eh[["EH2049"]]

# and are put together when calling...
sce_bonemarrow <- HCAData("ica_bone_marrow")
sce_bonemarrow
#> class: SingleCellExperiment 
#> dim: 33694 378000 
#> metadata(0):
#> assays(1): counts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#>   ENSG00000268674
#> rowData names(2): ID Symbol
#> colnames(378000): MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA-1
#>   MantonBM1_HiSeq_1-AAACCTGCACACTGCG-1 ...
#>   MantonBM8_HiSeq_8-TTTGTCATCTGCCAGG-1
#>   MantonBM8_HiSeq_8-TTTGTCATCTTGAGAC-1
#> colData names(1): Barcode
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

# similarly, to access the umbilical cord blood dataset
sce_cordblood <- HCAData("ica_cord_blood")
sce_cordblood
#> class: SingleCellExperiment 
#> dim: 33694 384000 
#> metadata(0):
#> assays(1): counts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#>   ENSG00000268674
#> rowData names(2): ID Symbol
#> colnames(384000): MantonCB1_HiSeq_1-AAACCTGAGGAGTTGC-1
#>   MantonCB1_HiSeq_1-AAACCTGAGGCATTGG-1 ...
#>   MantonCB8_HiSeq_8-TTTGTCATCCAGATCA-1
#>   MantonCB8_HiSeq_8-TTTGTCATCGGTCTAA-1
#> colData names(1): Barcode
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

4 Explore data with iSEE

The datasets are provided in the form of a SingleCellExperiment object. A natural companion to this data structure is the iSEE package, which can be used for interactive and reproducible data exploration.

Any analysis steps should be performed in advance before calling iSEE, and since these datasets can be quite big, the operations can be time consuming, and/or require a considerable amount of resources.

4.1 Processing a subset of the HCA bone marrow data

For the scope of the vignette, we subset some cells in the bone marrow dataset to reduce the runtime, and apply some of the steps one would ideally follow in analysing droplet based datasets. For more information on how to properly process such datasets, please refer to the amazing set of resources available in the simpleSingleCell workflow package.

In brief: we start loading the required libraries for preprocessing, and taking a subset of the bone marrow dataset

library("scran")
library("BiocSingular")
library("scater")
library("scuttle")

set.seed(42)
sce <- sce_bonemarrow[, sample(seq_len(ncol(sce_bonemarrow)), 1000, replace = FALSE)]

First, we relabel the rows with the gene symbols for easier reading with the uniquifyFeatureNames() function from scater. Then we compute some QC metrics and add these to the original sce object, using addPerCellQC().

rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))
#> [1] "RP11-34P13.3"  "FAM138A"       "OR4F5"         "RP11-34P13.7" 
#> [5] "RP11-34P13.8"  "RP11-34P13.14"

is.mito <- grep("MT-", rownames(sce))

counts(sce) <- as.matrix(counts(sce))
sce <- scuttle::addPerCellQC(sce, subsets=list(Mito=is.mito))

We proceed with normalization, performed with the deconvolution method implemented in scran - a pre-clustering step is done in advance, to avoid pooling cells that are very different between each other.

lib.sf.bonemarrow <- librarySizeFactors(sce)
summary(lib.sf.bonemarrow)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0157  0.3407  0.6291  1.0000  0.9087 18.0304

set.seed(42)
clusters <- quickCluster(sce)
table(clusters)
#> clusters
#>   1   2   3   4   5 
#> 137 232 207 239 185
sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters)
sce <- logNormCounts(sce)
assayNames(sce)
#> [1] "counts"    "logcounts"

In the following lines of code, we model the mean-variance trend (with technical noise as Poisson), to extract the biological component of the variance for the genes under inspection.

dec.bonemarrow <- modelGeneVarByPoisson(sce)
top.dec <- dec.bonemarrow[order(dec.bonemarrow$bio, decreasing=TRUE),] 
head(top.dec)
#> DataFrame with 6 rows and 6 columns
#>             mean     total      tech       bio   p.value       FDR
#>        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> S100A8   1.09003   5.19285   1.03231   4.16054         0         0
#> S100A9   1.08007   4.96772   1.02675   3.94097         0         0
#> MALAT1   7.23554   4.50786   0.69933   3.80853         0         0
#> LYZ      1.12696   4.79231   1.05247   3.73984         0         0
#> HBB      1.07098   4.35318   1.02161   3.33157         0         0
#> RPS27    4.96094   3.76764   1.22724   2.54039         0         0

fit.bonemarrow <- metadata(dec.bonemarrow)
plot(fit.bonemarrow$mean, fit.bonemarrow$var, xlab="Mean of log-expression",
    ylab="Variance of log-expression", pch = 16)
curve(fit.bonemarrow$trend(x), col="dodgerblue", add=TRUE, lwd=2)

plotExpression(sce, features=rownames(top.dec)[1:10])

With the denoisePCA function from scran, we can compute a reduced dimension view of the data, accounting for the Poisson technical trend.

Once we computed the PCA, we provide that as initialization to runTSNE, and obtain the t-SNE results, stored in the appropriate slot of our sce object.

hvg.bonemarrow <- getTopHVGs(dec.bonemarrow, prop = 0.1)

set.seed(42)
sce <- denoisePCA(sce, 
                  technical=dec.bonemarrow, 
                  subset.row = hvg.bonemarrow, 
                  BSPARAM=IrlbaParam())
ncol(reducedDim(sce, "PCA"))
#> [1] 5
plot(attr(reducedDim(sce), "percentVar"), xlab="PC",
     ylab="Proportion of variance explained")
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")

plotPCA(sce, ncomponents=3, colour_by="subsets_Mito_percent")


set.seed(42)
sce <- runTSNE(sce, dimred="PCA", perplexity=30)
plotTSNE(sce, colour_by="subsets_Mito_percent")

After this, we can compute a shared nearest neighbour graph to identify clusters in our dataset. Once the cluster memberships are defined, we assign this vector to a corresponding colData slot, and plot the t-SNE for our subset, coloured accordingly.

snn.gr <- buildSNNGraph(sce, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
sce$Cluster <- factor(clusters$membership)
table(sce$Cluster)
#> 
#>   1   2   3   4   5   6   7   8   9  10  11  12  13 
#>  62 218 109  96 121  56  41  62  55  82  30  34  34
plotTSNE(sce, colour_by="Cluster")

Even if we only took a subset of the available full data, we can observe that different clusters are indeed nicely separated. Subsequent steps would then involve identification of marker genes, and more advanced downstream techniques, which are not part of the scope of this vignette. To read more on what can be done, the vignettes of the simpleSingleCell workflow package are an excellent place to start.

4.2 Exploring the dataset with iSEE

Once the processing steps above are done, we can call iSEE with the subsampled SingleCellExperiment object.

if (require(iSEE)) {
  iSEE(sce)
}

4.3 Saving the processed object

You can save the sce object to a serialized R object with

destination <- "where/to/store/the/processed/data.rds"
saveRDS(sce, file = destination)

The object can be read into a new R session with readRDS(destination), provided the HDF5 file remains in its original location (conveniently stored in the default location of ExperimentHub).

Session info

sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scater_1.32.0               ggplot2_3.5.1              
#>  [3] BiocSingular_1.20.0         scran_1.32.0               
#>  [5] scuttle_1.14.0              rhdf5_2.48.0               
#>  [7] ExperimentHub_2.12.0        AnnotationHub_3.12.0       
#>  [9] BiocFileCache_2.12.0        dbplyr_2.5.0               
#> [11] HCAData_1.20.0              SingleCellExperiment_1.26.0
#> [13] SummarizedExperiment_1.34.0 Biobase_2.64.0             
#> [15] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
#> [17] IRanges_2.38.0              S4Vectors_0.42.0           
#> [19] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
#> [21] matrixStats_1.3.0           BiocStyle_2.32.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.2                 gridExtra_2.3            
#>   [3] rlang_1.1.3               magrittr_2.0.3           
#>   [5] compiler_4.4.0            RSQLite_2.3.6            
#>   [7] DelayedMatrixStats_1.26.0 png_0.1-8                
#>   [9] vctrs_0.6.5               pkgconfig_2.0.3          
#>  [11] crayon_1.5.2              fastmap_1.1.1            
#>  [13] magick_2.8.3              XVector_0.44.0           
#>  [15] labeling_0.4.3            utf8_1.2.4               
#>  [17] rmarkdown_2.26            ggbeeswarm_0.7.2         
#>  [19] UCSC.utils_1.0.0          tinytex_0.50             
#>  [21] purrr_1.0.2               bit_4.0.5                
#>  [23] xfun_0.43                 bluster_1.14.0           
#>  [25] zlibbioc_1.50.0           cachem_1.0.8             
#>  [27] beachmat_2.20.0           jsonlite_1.8.8           
#>  [29] blob_1.2.4                highr_0.10               
#>  [31] rhdf5filters_1.16.0       DelayedArray_0.30.0      
#>  [33] Rhdf5lib_1.26.0           BiocParallel_1.38.0      
#>  [35] irlba_2.3.5.1             parallel_4.4.0           
#>  [37] cluster_2.1.6             R6_2.5.1                 
#>  [39] bslib_0.7.0               limma_3.60.0             
#>  [41] jquerylib_0.1.4           Rcpp_1.0.12              
#>  [43] bookdown_0.39             knitr_1.46               
#>  [45] Matrix_1.7-0              igraph_2.0.3             
#>  [47] tidyselect_1.2.1          viridis_0.6.5            
#>  [49] abind_1.4-5               yaml_2.3.8               
#>  [51] codetools_0.2-20          curl_5.2.1               
#>  [53] lattice_0.22-6            tibble_3.2.1             
#>  [55] withr_3.0.0               KEGGREST_1.44.0          
#>  [57] Rtsne_0.17                evaluate_0.23            
#>  [59] Biostrings_2.72.0         pillar_1.9.0             
#>  [61] BiocManager_1.30.22       filelock_1.0.3           
#>  [63] generics_0.1.3            BiocVersion_3.19.1       
#>  [65] munsell_0.5.1             scales_1.3.0             
#>  [67] sparseMatrixStats_1.16.0  glue_1.7.0               
#>  [69] metapod_1.12.0            tools_4.4.0              
#>  [71] BiocNeighbors_1.22.0      ScaledMatrix_1.12.0      
#>  [73] locfit_1.5-9.9            cowplot_1.1.3            
#>  [75] grid_4.4.0                colorspace_2.1-0         
#>  [77] AnnotationDbi_1.66.0      edgeR_4.2.0              
#>  [79] GenomeInfoDbData_1.2.12   beeswarm_0.4.0           
#>  [81] HDF5Array_1.32.0          vipor_0.4.7              
#>  [83] cli_3.6.2                 rsvd_1.0.5               
#>  [85] rappdirs_0.3.3            fansi_1.0.6              
#>  [87] viridisLite_0.4.2         S4Arrays_1.4.0           
#>  [89] dplyr_1.1.4               gtable_0.3.5             
#>  [91] sass_0.4.9                digest_0.6.35            
#>  [93] ggrepel_0.9.5             SparseArray_1.4.0        
#>  [95] dqrng_0.3.2               farver_2.1.1             
#>  [97] memoise_2.0.1             htmltools_0.5.8.1        
#>  [99] lifecycle_1.0.4           httr_1.4.7               
#> [101] statmod_1.5.0             mime_0.12                
#> [103] bit64_4.0.5